Hacker News with Generative AI: Optimization

Shortest-possible walking tour to 81,998 bars in South Korea (uwaterloo.ca)
We have solved a traveling salesman problem (TSP) to walk to 81,998 bars in South Korea.
Link-Time Optimization of Dynamic Casts in C++ Programs [pdf] (ist.utl.pt)
Pushing the Limits of LLM Quantization via the Linearity Theorem (arxiv.org)
Quantizing large language models has become a standard way to reduce their memory and computational costs.
How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024) (alexarmbr.github.io)
This post details my recent efforts to write an optimized matrix multiplication kernel in CUDA using tensor cores on a NVIDIA Tesla T4 GPU. The goal is to compute $D = \alpha * A * B + \beta * C$, as fast as possible. In this equation $D,A,B$ and $C$ are large matrices full of half precision floating point numbers, and $\alpha$, $\beta$ are constants. This problem is usually referred to as a Half-precision Generalized Matrix Multiply, or HGEMM for short.
A Real-Time Algorithm for Non-Convex Powered Descent Guidance [pdf] (depts.washington.edu)
Efficient E-Matching for Super Optimizers (vortan.dev)
Modern theorem provers and optimizing compilers are built on an interesting concept: the ability to recognize when two things are equal, even if they look completely different.
Optimizing Heap Allocations in Go: A Case Study (dolthub.com)
Last month, a commit that was supposed to be a no-op refactor caused a 30% regression in sysbench's types_scan benchmark.
Everything You Need to Know About Incremental View Maintenance (materializedview.io)
Incremental view maintenance has been a hot topic lately.
Less Slow C++ (github.com/ashvardanian)
Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
Cutting down Rust compile times from 30 to 2 minutes with one thousand crates (feldera.com)
By simply changing how we generate Rust code under the hood, we’ve made Feldera’s compile times scale with your hardware instead of fighting it. What used to take 30–45 minutes now compiles in under 3 minutes, even for complex enterprise-scale SQL.
How to Optimize Rust for Slowness: Inspired by New Turing Machine Results (medium.com)
Everyone talks about making Rust programs faster [1, 2, 3], but what if we pursue the opposite goal? Let’s explore how to make them slower — absurdly slower. Along the way, we’ll examine the nature of computation, the role of memory, and the scale of unimaginably large numbers.
Cutting Down Rust Compile Times with One Thousand Crates (feldera.com)
By simply changing how we generate Rust code under the hood, we’ve made Feldera’s compile times scale with your hardware instead of fighting it.
Four Kinds of Optimisation (2023) (tratt.net)
Premature optimisation might be the root of all evil, but overdue optimisation is the root of all frustration. No matter how fast hardware becomes, we find it easy to write programs which run too slow. Often this is not immediately apparent. Users can go for years without considering a program’s performance to be an issue before it suddenly becomes so — often in the space of a single working day.
Fibonacci Hashing: The Optimization That the World Forgot (probablydance.com)
I recently posted a blog post about a new hash table, and whenever I do something like that, I learn at least one new thing from my comments. In my last comment section Rich Geldreich talks about his hash table which uses “Fibonacci Hashing”, which I hadn’t heard of before.
You might not need WebSockets (hntrl.io)
Websockets are powerful tools that have become a fan-favorite for building realtime applications, but you might be using them for all the wrong reasons.
Structural Optimization of I-Beams via Typographical Analysis (researchgate.net)
Grappling with Infinity in Constraint Solvers (tuzz.tech)
Many constraint-satisfaction problems deal with infinity in some shape or form.
PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth) (vectorchord.ai)
You might have come across discussions or blog posts suggesting that PostgreSQL's built-in full-text search (FTS) struggles with performance compared to dedicated search engines or specialized extensions.
A surprising enum size optimization in the Rust compiler (jpfennell.com)
Enums are one of the most popular features in Rust. An enum is type whose value is one of a specified set of variants.
Deleting multiplayer from the Unreal engine can save memory (larstofus.com)
Although the title of this article might suggest otherwise, I don’t dislike Unreal’s multiplayer features.
Dynamic Register Allocation on AMD's RDNA 4 GPU Architecture (chipsandcheese.com)
Modern GPUs often make a difficult tradeoff between occupancy (active thread count) and register count available to each thread.
Optimizing Matrix Multiplication (coffeebeforearch.github.io)
Matrix multiplication is an incredibly common operation across numerous domains. It is also known as being “embarrassingly parallel”. As such, one common optimization is parallelization across threads on a multi-core CPU or GPU. However, parallelization is not a panacea. Poorly parallelized code may provide minimal speedups (if any).
Using Token Sequences to Iterate Ranges (brevzin.github.io)
There was a StackOverflow question recently that led me to want to write a new post about Ranges. Specifically, I wanted to write about some situations in which Ranges do more work than it seems like they should have to. And then what we can do to avoid doing that extra work.
Journey to Optimize Cloudflare D1 Database Queries (github.com)
Recently, I've been working on server-side projects using Cloudflare Workers with D1 database. During this process, I encountered several database-related challenges. Since databases are quite unfamiliar territory for frontend developers, I decided to document my experiences.
The Fifth Kind of Optimisation (tratt.net)
A little while back I wrote about what I considered to be the four main kinds of optimisation:
Is Python Code Sensitive to CPU Caching? (2024) (lukasatkinson.de)
Cache-aware programming can make a huge performance difference, especially when writing code in C++ or Rust. Python is a much more high-level language, and doesn't give us that level of control over memory layout of our data structures. So does this mean that CPU caching effects aren't relevant in Python?
Show HN: Terminal dashboard that throttles my PC during peak electricity rates (naveen.ing)
Foundation Model for Personalized Recommendation (netflixtechblog.com)
Netflix’s personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including “Continue Watching” and “Today’s Top Picks for You.” (Refer to our recent overview for more details). However, as we expanded our set of personalization algorithms to meet increasing business needs, maintenance of the recommender system became quite costly.
Go Optimization Guide (goperf.dev)
Tail Call Recursion in Java with ASM (2023) (unlinkedlist.org)
One kind of optimization offered by some compilers is tail call optimization. This optimization does not bring much, since the programmer can always tailor his code without recursion, especially in an imperative language. On the other side, recursive code often times more elegant, so why we don’t let the compiler do the nasty stuff when it is possible? In this article I will present a neat way to implement tail call optimization in Java using byte code manipulation with ASM.