Hacker News with Generative AI: Performance Optimization

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch (arxiv.org)
CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases.
Hyperscaling Have I Been Pwned with Cloudflare Workers and Caching (troyhunt.com)
I've spent more than a decade now writing about how to make Have I Been Pwned (HIBP) fast. Really fast. Fast to the extent that sometimes, it was even too fast:
Building a Fast, SIMD/GPU-Friendly Random Number Generator for Fun and Profit (vectrx.substack.com)
When writing shaders, SIMD code, or GPU kernels, one typically doesn’t need a cryptographically secure random number generator — something fast and statistically decent is often good enough.
Coding Neon Kernels for the Cortex-A53 (destevez.net)
Some weeks ago, I presented at FOSDEM my work-in-progress high performance SDR runtime qsdr. I showed a hand-written NEON assembly implementation of a kernel that computes \(y[n] = ax[n] + b\), which I used as the basic math block for benchmarks on a Kria KV260 board (which has a quad-core ARM Cortex-A53 at 1.33 GHz). In that talk I glossed over the details of how I implemented this NEON kernel.
Achieveing lower latencies with S3 object storage (spiraldb.com)
Over the past 19 years (S3 was launched on March 14th 2006, as the first public AWS service), object storage has become the gold standard for storing large amounts of data in the cloud. It's reliable, reasonably cheap, reasonably fast, and requires no special incantations to deploy. Best of all, it offers a straightforward HTTP-based interface with clear semantics (see NFS horrors).
Show HN: K(r)ep - A high-performance string search utility (github.com/davidesantangelo)
Raw Loops for Performance? (sandordargo.com)
To my greatest satisfaction, I’ve recently joined a new project. I started to read through the codebase before joining and at that stage, whenever I saw a possibility for a minor improvement, I raised a tiny pull request. One of my pet peeves is rooted in Sean Parent’s 2013 talk at GoingNative, Seasoning C++ where he advocated for no raw loops.
New Apache Cassandra Release Saves 400% IOPS (simplyblock.io)
On April 10, 2025, the Apache Software Foundation released version 5.0.4 of Apache Cassandra, bringing significant performance optimizations for all users—but especially for those relying on remotely attached storage like Amazon EBS. The standout feature in this release is an overhaul of the compaction algorithm aimed at slashing IOPS usage while increasing overall throughput.
Battle of the Mallocators (blogspot.com)
If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here).
Vertical Sharding Sucks (pgdog.dev)
Vertical sharding, sometimes called functional sharding, takes tables out of your main database and puts them somewhere else. Most of the time, it’s another Postgres database. This reduces load on the main DB and gives your app some breathing room to grow.
Container CPU requests and limits explained with GOMAXPROCS tuning (victoriametrics.com)
In this article, we’re going to cover a few things that might’ve puzzled you if you’ve been running your applications, especially Go applications, in Kubernetes:
We Chose Tauri over Electron for Our Performance-Critical Desktop App (gethopp.app)
At Hopp, we're building a cross-platform remote control application designed for a low-latency remote pair programming experience. Providing the best possible user experience is our top priority.
ClickHouse Denormalization is not the answer to slow JOINs (glassflow.dev)
Reducing query times and optimizing performance are critical goals when working with ClickHouse, a fast, open-source, columnar database.
EngFlow Makes C++ Builds 21x Faster and Software a Lot Safer (thenewstack.io)
Faster Shuffling in Go with Batching (lemire.me)
Go’s rand.Shuffle is a solid baseline, but batching random integer generation can make it much faster. By generating multiple random numbers from a single 64-bit value, we can boost efficiency—by over 2x in our benchmarks.
Making OCaml Safe for Performance Engineering [video] (youtube.com)
Show HN: uWrap.js – A faster and more accurate text wrapping util in < 2KB (github.com/leeoniya)
uWrap exists to efficiently predict varying row heights for list and grid virtualization, a technique for UI performance optimization when rendering large, scrollable datasets.
Making OCaml Safe for Performance Engineering [video] (youtube.com)
Engineering a Trace Details Page That Handles a Million Spans (signoz.io)
Golang sync.Pool is not a silver bullet (wundergraph.com)
When it comes to performance optimization in Go, sync.Pool often appears as a tempting solution. It promises to reduce memory allocations and garbage collection pressure by reusing objects. But is it always the right choice? Let's dive deep into this fascinating topic.
Growing Buffers to Avoid Copying Data (johnnysswlab.com)
We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
Why Adding a Full Hard Drive Can Make a Computer More Powerful (wired.com)
“Obviously” is a dangerous word, even in scenarios that seem simple. Suppose, for instance, you need to do an important computation. You get to choose between two computers that are almost identical, except that one has an extra hard drive full of precious family photos. It’s natural to assume that the two options are equally good—that an extra drive with no space remaining won’t aid your computation.
Ubuntu Provides More Insight into Their Decision Not to "-O3" All Packages (phoronix.com)
Since last year Canonical had been investigating using -O3 compiler optimizations for their Ubuntu package builds in the name of delivering better performance for Ubuntu Linux.
Disk I/O bottlenecks in GitHub Actions (depot.dev)
When your CI pipelines are slow, you can only optimize so much. Bottlenecks in CPU, Network, Memory, and Disk I/O can all contribute to slow CI pipelines. Let's take a look at how disk I/O can be a bottleneck in GitHub Actions.
Faster interpreters in Go: Catching up with C++ (planetscale.com)
The SQL evaluation engine that ships with Vitess, the open-source database that powers PlanetScale, was originally implemented as an AST evaluator that used to operate directly on the SQL AST generated by our parser. Over this past year, we've gradually replaced it with a Virtual Machine which, despite being written natively in Go, performs similarly to the original C++ evaluation code in MySQL.
The Curious Case of Beam CPU Usage (2019) (stressgrid.com)
While benchmarking Go vs Elixir vs Node, we discovered that Elixir (running on the BEAM virtual machine) had much higher CPU usage than Go, and yet its responsiveness remained excellent. Some of our readers suggested that busy waiting may be responsible for this behavior.
Prospero challenge, now with more garbage collection (bernsteinbear.com)
Matt Keeter put up The Prospero Challenge, which is like catnip for me. It’s a well-scoped project: we have a slow program. Make it faster within these constraints. In this post, I will describe two very small changes that can speed up his sample program with minimal effort.
Btrfs Adding Fast/Realtime ZSTD Compression and Other Performance Optimizations (phoronix.com)
David Sterba of SUSE sent in all of the Btrfs file-system updates today for the now-open Linux 6.15 kernel merge window.
Fast columnar JSON decoding with arrow-rs (arroyo.dev)
JSON is the most common serialization format used in streaming pipelines, so it pays to be able to deserialize it fast. This post covers in detail how the arrow-json library works to perform very efficient columnar JSON decoding, and the additions we've made for streaming use cases.
Optimizing by 1700x by not being silly (ayende.com)
I care about the performance of RavenDB. Enough that I would go to epic lengths to fix them. Here I use “epic” both in terms of the Agile meaning of multi-month journeys and the actual amount of work required. See my recent posts about RavenDB 7.1 I/O work.