Hacker News with Generative AI: Optimization

The Fifth Kind of Optimisation (tratt.net)
A little while back I wrote about what I considered to be the four main kinds of optimisation:
Show HN: Terminal dashboard that throttles my PC during peak electricity rates (naveen.ing)
Foundation Model for Personalized Recommendation (netflixtechblog.com)
Netflix’s personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including “Continue Watching” and “Today’s Top Picks for You.” (Refer to our recent overview for more details). However, as we expanded our set of personalization algorithms to meet increasing business needs, maintenance of the recommender system became quite costly.
Go Optimization Guide (goperf.dev)
Tail Call Recursion in Java with ASM (2023) (unlinkedlist.org)
One kind of optimization offered by some compilers is tail call optimization. This optimization does not bring much, since the programmer can always tailor his code without recursion, especially in an imperative language. On the other side, recursive code often times more elegant, so why we don’t let the compiler do the nasty stuff when it is possible? In this article I will present a neat way to implement tail call optimization in Java using byte code manipulation with ASM.
HTTP/2 zero latency write coalescing (nitely.github.io)
Write coalescing is an I/O optimization technique where multiple small writes are merged into a single larger write before sending data to the underlying system. In Http/2, we can batch multiple frames from one or more streams and send them all at once. This reduces the number of syscalls, and avoids sending tiny TCP packets under load.
FFN Fusion: Rethinking Sequential Computation in Large Language Models (arxiv.org)
We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization.
Optimizing ML training with metagradient descent (arxiv.org)
A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space.
DNS Speed Test (dnsspeedtest.online)
Optimize your internet experience by finding the fastest DNS server for your location. Just click the button below to start the test.
C++26 Expansion Tricks (pydong.org)
P1306 gives us compile time repetition of a statement for each element of a range - what if we instead want the elements as a pack without introducing a new function scope?
Quantitative Finance: Kronecker-Factored Approximate Curvature Deep Hedging (arxiv.org)
This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization.
Teardown, Optimization: Comsol 8Gb USB Flash Stick (2015) (goughlui.com)
A while back, I received a Comsol 8Gb USB Flash Stick for a test. As it turns out, I’ve managed to grab another, so I felt less bad about breaking one apart to work out what’s inside – and as it turns out, it provided me a world of entertainment for the weekend. It was more than I expected, and the optimization process is something engineers (like myself) really get excited about.
Activision Cut Call of Duty's Build Time by 50% (microsoft.com)
Slow build times are a major headache for developers, especially in large, complex C++ codebases like game engines.
Optimizing Brainfuck interpreter in the C preprocessor (github.com/camel-cdr)
A C99 confirming* optimizing Brainfuck implementation written (and executed) only using the C preprocessor.
Ask HN: Why some languages use 1 byte for boolean type (ycombinator.com)
Some programming languages like D use 8 bits for their boolean type, why they don't use 1 bit ?
Jagged Flash Attention Optimization (shaped.ai)
Meta researchers have introduced Jagged Flash Attention, a novel technique that significantly enhances the performance and scalability of large-scale recommendation systems.
Specializing Python with E-Graphs (vectorfold.studio)
We've explored progressively more sophisticated techniques for optimizing numerical computations. We started with basic MLIR concepts, moved through memory management and linear algebra, and then neural network implementations. Each layer has added new capabilities for expressing and optimizing computations. Now we're reading to build our first toy compiler for Python expressions.
Quantum Speedup Found for Class of Hard Problems (quantamagazine.org)
It’s been difficult to find important questions that quantum computers can answer faster than classical machines, but a new algorithm appears to do it for some critical optimization tasks.
Speeding up C++ code with template lambdas (lemire.me)
Let us consider a simple C++ function which divides all values in a range of integers:
zlib-ng: zlib replacement with optimizations for "next generation" systems (github.com/zlib-ng)
zlib replacement with optimizations for "next generation" systems.
Speeding up C++ code with template lambdas (lemire.me)
Let us consider a simple C++ function which divides all values in a range of integers:
A 2FA app that tells you when you get `314159` (2024) (jacobstechtavern.com)
This was a pretty fun project: not only did I manage to tickle the part of my geek brain which loves spotting patterns; I got to handle some nifty processing, threading, and optimisation problems!
Parallel Histogram Computation with CUDA (khushi-411.github.io)
The aim of the blog posts is to introduce a parallel histogram pattern, where each output element can be updated by any thread. Therefore, we should coordinate among threads as they update the output value. In this blog post, we will read the introduction about using atomic operations to serialize the updates of each element. Then, we will study an optimization technique: privatization. Let’s dig in!
How to Minify Godot's Build Size (93MB –> 6.4MB EXE) (bearblog.dev)
This was an article I wanted to do for ages. Godot's default file size for web exports is quite massive, and there weren't a lot of guides on how to reduce it aside from the official documentation - which gives general information without any numbers or details, leaving people to figure out how effective any of the solutions really are. It also doesn't mention some more advanced tricks we can use which I'll mention here.
One character change provides 20% savings for Meta (theregister.com)
Meta says it has managed to reduce the CPU cycles of its top services by 20 percent through its Strobelight profiling orchestration suite, which relies on the open source eBPF project.
Bypassing the Branch Predictor (nicula.xyz)
A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.
Improving on std:count_if()'s auto-vectorization (nicula.xyz)
Deriving Muon (jeremybernste.in)
We recently proposed Muon: a new neural net optimizer. Muon has garnered attention for its excellent practical performance: it was used to set NanoGPT speed records leading to interest from the big labs.
Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue” (openpipe.ai)
In this post we’ll discuss how we used Group Relative Policy Optimization (GRPO) to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.
Succinct data structures (startifact.com)
A few months ago, searching for ideas on how to make some code faster, I found myself reading a bunch of computer science papers.