Hacker News with Generative AI: Performance Optimization

PostgreSQL Lands Self-Join Elimination Optimization (phoronix.com)
More than seven years in the making, merged yesterday for PostgreSQL is a self-join elimination "SJE" feature as a performance optimization for some queries.
Show HN: Pulse – Maintain healthy OpenSearch and Elasticsearch clusters (pulse.support)
Pulse puts you in control of your search cluster monitoring and maintenance. Get more clarity, better performance, and lower costs
Nginx: try_files Is Evil Too (2024) (getpagespeed.com)
0+0 > 0: C++ thread-local storage performance (yosefk.com)
We'll discuss how to make sure that your access to TLS (thread-local storage) is fast. If you’re interested strictly in TLS performance guidelines and don't care about the details, skip right to the end — but be aware that you’ll be missing out on assembly listings of profound emotional depth, which can shake even a cynical, battle-hardened programmer. If you don’t want to miss out on that — and who would?!
A tail calling interpreter for Python (already landed in CPython) (reverberate.org)
It’s been nearly four years since I published Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C. In that article, I presented a technique I co-developed for how to write really fast interpreters through the use of tail calls and the musttail attribute.
Low-Latency Transaction Scheduling via Userspace Interrupts [pdf] (cs.sfu.ca)
Google Lighthouse recommends not embedding YouTube videos directly (chrome.com)
Third-party resources are often used for displaying ads or videos and integrating with social media. The default approach is to load third-party resources as soon as the page loads, but this can unnecessarily slow the page load. If the third-party content is not critical, this performance cost can be reduced by lazy loading it.
PgAssistant: OSS tool to help devs understand and optimize PG performance (github.com/nexsol-technologies)
PgAssistant is an open-source tool designed to help developers understand and optimize their PostgreSQL database performance.
Show HN: Seen – Virtual list rendering with 1M+ notes (vercel.app)
Gh-128563: A new tail-calling interpreter for Python 3.14 (github.com/python)
How to GIF (2025 Edition) (fullystacked.net)
Back in 2022 I published the article GIFs Without the .gif: The Most Performant Image and Video Options Right Now on CSS Tricks. Certain information in that post is now out of date:
Profiling in production with function call traces (yosefk.com)
A timeline showing function call and return events is a great way to debug performance problems, especially in production. In particular, it's often much more effective than traditional sampling profilers, for reasons we’ll discuss. However, the adoption of function tracing in the industry remains uneven because of a chicken-and-egg problem.
Chinese algorithm claimed to boost Nvidia GPU performance by up to 800X (tomshardware.com)
Setting up a Linux writecache as a RAM disk (2019) (admin-magazine.com)
Kicking write I/O operations into overdrive with the Linux device mapper writecache.
Llama.cpp PR with 99% of code written by DeepSeek-R1 (github.com/ggerganov)
This PR provides a big jump in speed for WASM by leveraging SIMD instructions for qX_K_q8_K and qX_0_q8_0 dot product functions.
WebFFT – The Fastest Fourier Transform on the Web (github.com/IQEngine)
The Fastest Fourier Transform on the Web!
The Mythical IO-Bound Rails App (byroot.github.io)
When the topic of Rails performance comes up, it is commonplace to hear that the database is the bottleneck, so Rails applications are IO-bound anyway, hence Ruby performance doesn’t matter that much, and all you need is a healthy dose of concurrency to make your service scale.
The Mythical IO-Bound Rails App (byroot.github.io)
When the topic of Rails performance comes up, it is commonplace to hear that the database is the bottleneck, so Rails applications are IO-bound anyway, hence Ruby performance doesn’t matter that much, and all you need is a healthy dose of concurrency to make your service scale.
Making the fastest phrase search algo with the most unhinged AVX512 instruction (gab-menezes.github.io)
For those who don’t want to read/don’t care that much, here are the results. I hope after seeing them you are compelled to read. TL;DR: I wrote a super fast phrase search algorithm using AVX-512 and achieved wins up to 1600x the performance of Meilisearch.
Checking whether an ARM NEON register is zero (lemire.me)
Your phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth.
Writing High Performance F# code (bartoszsypytkowski.com)
While this post is addressed to F# .NET developers, it introduces much wider concepts starting from hardware architecture to overall .NET runtime and JIT compiler optimizations. It shouldn't be a surprise - optimizing the application performance requires us to understand the relationships between our high level code and what actually happens on the hardware.
Show HN: Billion Cell Spreadsheets with Incremental Computation (feldera.io)
Loading…
Train faster static embedding models with sentence transformers (huggingface.co)
This blog post introduces a method to train static embedding models that run 100x to 400x faster on CPU than state-of-the-art embedding models, while retaining most of the quality. This unlocks a lot of exciting use cases, including on-device and in-browser execution, edge computing, low power and embedded applications.
Beating cuBLAS in Single-Precision General Matrix Multiplication (salykova.github.io)
This project is inspired by the outstanding works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao and the best CUDA hackers from the GPU MODE community (Discord server). A special thanks to Mark Saroufim and Andreas Köpf for running GPU MODE and all you’ve done for the community.
Fedora 42 Looks to Ship Optimized Executables for Different x86_64 Capabilities (phoronix.com)
Fedora Linux has already supported making use of glibc HWCAPs for allowing libraries to be built for different x86_64 micro-architecture feature levels for performance-sensitive code where it can pay off when leveraging AVX/AVX2 or other newer Intel/AMD CPU instruction set extensions. For Fedora 42 is now a proposal to extend that further to allow binary executables to also leverage glibc HWCAPs for better performance.
CSSWind: Bloat-Free Component Styling (xeiaso.net)
What you need when even HTMX is too much.
YJIT 3.4: Even Faster and More Memory-Efficient (railsatscale.com)
It’s 2025, and this year again, the YJIT team brings you a new version of YJIT that is even faster, more stable, and more memory-efficient.
Double-keyed caching: Browser cache partitioning (addyosmani.com)
The web’s caching model served us well for over two decades. Recently, in the name of privacy, it’s undergone a fundamental shift that challenges many of our performance optimization assumptions. This is called Double-keyed Caching or cache-partitioning more generally. Here’s what changed, why it matters, and how to adapt.
Expressive Vector Engine – SIMD in C++ (github.com/jfalcou)
EVE is a re-implementation of the old EVE SIMD library by Falcou et al. which for a while was named Boost.SIMD. It's a C++20 and onward implementation of a type based wrapper around SIMD extensions sets for most current architectures. It aims at showing how C++20 can be used to design and implement efficient, low level, high abstraction library suited for high performance.
Breaking Up with Long Tasks or: how I learned to group loops and wield the yield (perfplanet.com)
Arrays are in every web developer’s toolbox, and there are a dozen ways to iterate over them. Choose wrong, though, and all of that processing time will happen synchronously in one long, blocking task. The thing is, the most natural ways are the wrong ways. A simple for..of loop that processes each array item is synchronous by default, while Array methods like forEach and map can ONLY run synchronously.