Hacker News with Generative AI: Performance Optimization

Checking whether an ARM NEON register is zero (lemire.me)
Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth.
Writing High Performance F# code (bartoszsypytkowski.com)
While this post is addressed to F# .NET developers, it introduces much wider concepts starting from hardware architecture to overall .NET runtime and JIT compiler optimizations. It shouldn't be a surprise - optimizing the application performance requires us to understand the relationships between our high level code and what actually happens on the hardware.
Show HN: Billion Cell Spreadsheets with Incremental Computation (feldera.io)
Loading…
Train faster static embedding models with sentence transformers (huggingface.co)
This blog post introduces a method to train static embedding models that run 100x to 400x faster on CPU than state-of-the-art embedding models, while retaining most of the quality. This unlocks a lot of exciting use cases, including on-device and in-browser execution, edge computing, low power and embedded applications.
Beating cuBLAS in Single-Precision General Matrix Multiplication (salykova.github.io)
This project is inspired by the outstanding works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao and the best CUDA hackers from the GPU MODE community (Discord server). A special thanks to Mark Saroufim and Andreas Köpf for running GPU MODE and all you’ve done for the community.
Fedora 42 Looks to Ship Optimized Executables for Different x86_64 Capabilities (phoronix.com)
Fedora Linux has already supported making use of glibc HWCAPs for allowing libraries to be built for different x86_64 micro-architecture feature levels for performance-sensitive code where it can pay off when leveraging AVX/AVX2 or other newer Intel/AMD CPU instruction set extensions. For Fedora 42 is now a proposal to extend that further to allow binary executables to also leverage glibc HWCAPs for better performance.
CSSWind: Bloat-Free Component Styling (xeiaso.net)
What you need when even HTMX is too much.
YJIT 3.4: Even Faster and More Memory-Efficient (railsatscale.com)
It’s 2025, and this year again, the YJIT team brings you a new version of YJIT that is even faster, more stable, and more memory-efficient.
Double-keyed caching: Browser cache partitioning (addyosmani.com)
The web’s caching model served us well for over two decades. Recently, in the name of privacy, it’s undergone a fundamental shift that challenges many of our performance optimization assumptions. This is called Double-keyed Caching or cache-partitioning more generally. Here’s what changed, why it matters, and how to adapt.
Expressive Vector Engine – SIMD in C++ (github.com/jfalcou)
EVE is a re-implementation of the old EVE SIMD library by Falcou et al. which for a while was named Boost.SIMD. It's a C++20 and onward implementation of a type based wrapper around SIMD extensions sets for most current architectures. It aims at showing how C++20 can be used to design and implement efficient, low level, high abstraction library suited for high performance.
Breaking Up with Long Tasks or: how I learned to group loops and wield the yield (perfplanet.com)
Arrays are in every web developer’s toolbox, and there are a dozen ways to iterate over them. Choose wrong, though, and all of that processing time will happen synchronously in one long, blocking task. The thing is, the most natural ways are the wrong ways. A simple for..of loop that processes each array item is synchronous by default, while Array methods like forEach and map can ONLY run synchronously.
Mptcp: Revolutionizing connectivity, one path at a time (cloudflare.com)
The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in RFCs documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time.
Speeding Up SQLite Inserts (julik.nl)
In my work I tend to reach for SQLite more and more. The type of work I find it useful for most these days is quickly amalgamating, dissecting, collecting and analyzing large data sets. As I have outlined in my Euruko talk on scheduling, a key element of the project was writing a simulator. That simulator outputs metrics - lots and lots of metrics, which resemble what our APM solution collects.
Postgres UUIDv7 and per-back end monotonicity (brandur.org)
An implementation for UUIDv7 was committed to Postgres earlier this month. These have all the benefits of a v4 (random) UUID, but are generated with a more deterministic order using the current time, and perform considerably on inserts using ordered structures like B-trees.
Static search trees: faster than binary search (curiouscoding.nl)
In this post, we will implement a static search tree (S+ tree) for high-throughput searching of sorted data, as introduced on Algorithmica.
Optimizing Ruby's JSON, Part 4 (byroot.github.io)
In the previous post, we established that as long as ruby/json wasn’t competitive on micro-benchmarks, public perception wouldn’t change. Since what made ruby/json appear so bad on micro-benchmarks was its setup cost, we had to find ways to reduce it further.
Peephole optimizations: adding `opt_respond_to` to the Ruby VM, part 4 (jpcamara.com)
In The Ruby Syntax Holy Grail: adding opt_respond_to to the Ruby VM, part 3, I found what I referred to as the “Holy Grail” of Ruby syntax. I’m way overstating it, but it’s a readable, sequential way of viewing how a large portion of the Ruby syntax is compiled.
Subprocess: Don't close all file descriptors by default (close_fds=False) (python.org)
To make subprocess faster, I propose to no longer close all file descriptors by default in subprocess: change Popen close_fds parameter default to False (close_fds=False).
The intricacies of implementing memoization in Ruby (denisdefreyne.com)
In the never-ending quest to write code that is performant, we have many techniques at our disposal. One of those techniques is memoization,111 That’s memoization, not memorization — there’s no “r”!  which boils down to storing the results of expensive function calls, so that these expensive functions do not need to be called more than absolutely necessary.
Navtive FlameGraphViewer (laladrik.xyz)
There is something in Rust Analyzer that I would like to fix. This requires understanding its interaction with Chalk. To find the starting point I ran Rust Analyzer with Linux Perf to get the tree of calls represented in a Flame Graph. The Flame Graph was so big, that it was rendered in the browser for quite a few seconds. The hover events were delayed. Nothing happened when I tried to open a frame of the graph.
Reads Causing Writes in Postgres (jesipow.com)
It is good practice to regularly inspect the statements running in the hot path of your Postgres instance. One way to do this is to examine the pg_stat_statements view, which shows various statistics about the SQL statements executed by the Postgres server.
Show HN: Bodo – high-performance compute engine for Python data processing (github.com/bodo-ai)
Bodo is a cutting edge compute engine for large scale Python data processing. Powered by an innovative auto-parallelizing just-in-time compiler, Bodo transforms Python programs into highly optimized, parallel binaries without requiring code rewrites, which makes Bodo 20x to 240x faster compared to alternatives!
Optimizing Ruby's JSON, Part 1 (byroot.github.io)
I was recently made maintainer of the json gem, and aside from fixing some old bugs, I focused quite a bit on its performance, so that it is now the fastest JSON parser and generator for Ruby on most benchmarks.
Valhalla – Java's Epic Refactor (inside.java)
Project Valhalla wants to heal the rift in Java’s type system between classes and primitives by introducing value classes, which “code like a class, work like an int” and offer a flat and dense memory layout.
In Search of a Faster SQLite (avi.im)
SQLite is already fast. But can we make it even faster? Researchers at the University of Helsinki and Cambridge began with this question and published a paper, “Serverless Runtime / Database Co-Design With Asynchronous I/O”. They demonstrate up to a 100x reduction in tail latency. These are my notes on the paper.
In Search of a Faster SQLite (avi.im)
SQLite is already fast. But can we make it even faster? Researchers at the University of Helsinki and Cambridge began with this question and published a paper, “Serverless Runtime / Database Co-Design With Asynchronous I/O”. They demonstrate up to a 100x reduction in tail latency. These are my notes on the paper.
Algorithms for high performance terminal apps (textualize.io)
I've had the fortune of being able to work fulltime on a FOSS project for the last three plus years.
Fair Go vs. Elixir Benchmarks (github.com/antonputra)
The code previously used Jason.encode! but Jason.encode_to_iodata! should be preferred over IO devices. This should increase performance and reduce memory usage. This is what frameworks such as a Phoenix would have used by default
My wish for VFS or filesystem level cgroup (v2) IO limits (utoronto.ca)
I wish Linux cgroups (v2 of course) had an option/interface that limited *filesystem* IO that you could do, read and/or write.
Turning Off Zen 4's Op Cache for Curiosity and Giggles (chipsandcheese.com)
CPUs start executing instructions by fetching those instruction bytes from memory and decoding them into internal operations (micro-ops). Getting data from memory and operating on it consumes power and incurs latency. Micro-op caching is a popular technique to improve on both fronts, and involves caching micro-ops that correspond to frequently executed instructions.