Hacker News with Generative AI: Performance Optimization

Show HN: AutoThink – Boosts local LLM performance with adaptive reasoning (ycombinator.com)
I built AutoThink, a technique that makes local LLMs reason more efficiently by adaptively allocating computational resources based on query complexity.
Look Ma, No Bubbles: Designing a Low-Latency Megakernel for Llama-1B (hazyresearch.stanford.edu)
There are some applications that benefit from running LLMs really, really fast. This low-latency regime encompasses applications like chatbots and human-in-the-loop workflows, where users care a lot about seeing responses come back immediately.
Python Pandas Ditches NumPy for Speedier PyArrow (thenewstack.io)
Unlocking Ractors: class instance variables in Ruby (byroot.github.io)
In a previous post about ractors, I explained why I think it’s really unlikely you’d ever be able to run an entire application inside a ractor, but that they could still be situationally very useful to move CPU-bound work out of the main thread, and to unlock some parallel algorithm.
Improving performance of original dav1d video decoder (videolan.org)
I noticed a very clickbait bounty, I initially realized that company's original task was not to overtake implementation, but to advertise that Rust is 5% slower than C. Whether she actually pays or not is another matter. The main thing for Prossimo was to make a fuss that the current rav1d implementation was only 5% slower, so that the general public would think that the language was the same in speed.
Faster Firewalls with Bpfilter (lwn.net)
From servers in a data center to desktop computers, many devices communicating on a network will eventually have to filter network traffic, whether it's for security or performance reasons. As a result, this is a domain where a lot of work is put into improving performance: a tiny performance improvement can have considerable gains.
Accelerating Docker Builds by Halving EC2 Boot Time (depot.dev)
We at Depot like making shit fast, whether that's Docker image builds, Github Actions runners, Bazel caching, Turborepo, or even our own infrastructure.
Whippet GC notes on Guile, heuristics, and heap growth (wingolog.org)
Greets all! Another brief note today. I have gotten Guile working with one of the Nofl-based collectors, specifically the one that scans all edges conservatively (heap-conservative-mmc / heap-conservative-parallel-mmc). Hurrah!
Slack, Notion, and VSCode Improved Electron App Performance (palette.dev)
Leading the development of electron-react-boilerplate for over a decade has taught me a lot about bottlenecks in Electron apps and how to work around them. Properly engineered, Electron apps can closely rival the performance of native apps. This post is a complete guide on exploiting every Electron performance optimization I know so that you can get the most mileage.
More than you ever wanted to know about font loading on the web (2021) (industrialempathy.com)
When I started thinking about writing a post about web font loading my intention was to propose relatively sophisticated ideas that I've been playing with for a while. However, as I was trying to use them in real-world websites I realized that deployment of the more advanced techniques is de-facto impossible without the creation of new web standards.
FUSE to Enjoy a Performance Improvement with Linux 6.16 (phoronix.com)
Queued up via the FUSE "for-next" Git branch ahead of the upcoming Linux 6.16 merge window is a change to increase the read directory buffer size to in turn enhance the performance.
Understanding the Go Scheduler (nghiant3223.github.io)
Understanding the Go scheduler is crucial for Go programmer to write efficient concurrent programs. It also helps us become better at troubleshooting performance issues or tuning the performance of our Go programs. In this post, we will explore how Go scheduler evolved over time, and how the Go code we write happens under the hood.
SQL OFFSET is worse than keyset pagination (use-the-index-luke.com)
After implementing a pipelined top-N query to retrieve the first page efficiently, you will often also need another query to fetch the next pages. The resulting challenge is that it has to skip the rows from the previous pages.
Precomputing Transparency Order in 3D (jacobdoescode.com)
Transparency — or more precisely, translucency — remains a problem when rendering in 3D. When you have translucent shapes, the order in which they get rendered is very important. Consider what happens if this is done incorrectly.
Jetrelay: A high-performance ATproto relay in 500 LOC (asayers.com)
This post explains the design of jetrelay, a pub/sub server compatible with Bluesky’s “jetstream” data feed. Using a few pertinent Linux kernel features, it avoids doing almost any work itself. As a result, it’s highly efficient: it can saturate a 10 Gbps network connection with just 8 CPU cores.
Use Method: Linux Performance Checklist (brendangregg.com)
The USE Method provides a strategy for performing a complete check of system health, identifying common bottlenecks and errors. For each system resource, metrics for utilization, saturation and errors are identified and checked. Any issues discovered are then investigated using further strategies.
Binary Formats Are Better Than JSON in Browsers (adamfaulkner.github.io)
JSON used to be faster than alternatives in browsers, but that's not the case anymore. For performance sensitive web apps, it is worth considering Avro, Protobuf, or Bebop.
Show HN: LoopMix128 – Fast C PRNG (.46ns), 2^128 Period, BigCrush/PractRand Pass (github.com/danielcota)
This repository contains LoopMix128, an extremely fast pseudo-random number generator (PRNG) with a guaranteed period of 2^128, proven injectivity, and clean passes in both BigCrush and PractRand (32TB). It is designed for non-cryptographic applications where speed and statistical quality are important.
21 GB/s CSV Parsing Using SIMD on AMD 9950X (nietras.com)
Sep 0.10.0 was released April 22nd, 2025 with optimizations for AVX-512 capable CPUs like the AMD 9950X (Zen 5) and updated benchmarks including the 9950X. Sep now achieves a staggering 21 GB/s on the 9950X for the low-level CSV parsing. 🚀 Before 0.10.0, Sep achieved ~18 GB/s on 9950X.
Implementing a Struct of Arrays (brevzin.github.io)
Recently, I watched Andrew Kelley’s talk on Practical Data Oriented Design. It goes into some of the architectural changes he’s been making to the Zig compiler, with pretty significant performance benefit. Would definitely recommend checking out the talk, even if you’re like me and have never written any Zig.
V8 JavaScript engine gets eager compilation hints (devclass.com)
The V8 JavaScript engine, used by the Chrome web browser, Node.js and elsewhere, has a new feature which lets developers mark a file for early compilation, with strong benefits for load time provided the option is used sparingly.
QUIC restarts, slow problems: udpgrm to the rescue (cloudflare.com)
At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.
Inheritance was invented as a performance hack (2021) (catern.com)
Inheritance was invented by the Simula language as a way to support intrusive lists, save memory, and simplify the garbage collector.
Critical CSS (kigo.studio)
Another look into PostgreSQL CTE materialization and non-idempotent subqueries (shayon.dev)
A few days ago, I wrote about a surprising planner behavior with CTEs, DELETE, and LIMIT in PostgreSQL, a piece I hastily put together on a bus ride.
Distributed Continuous GPU Profiling (zymtrace.com)
Identify performance bottlenecks in CUDA kernels, optimize inference batch size, and eliminate idle GPU cycles —with zero friction.
Making PyPI's test suite 81% faster (trailofbits.com)
Trail of Bits has collaborated with PyPI for several years to add features and improve security defaults across the Python packaging ecosystem.
Making PyPI's test suite 81% faster (trailofbits.com)
Trail of Bits has collaborated with PyPI for several years to add features and improve security defaults across the Python packaging ecosystem.
Optimizing eBPF I/O latency accounting when running 37M IOPS, on 384 CPUs (tanelpoder.com)
In this post I will introduce a much more efficient method for accounting block I/O latencies with eBPF on Linux.
Dataframely: A polars-native data frame validation library (quantco.com)
At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.