Hacker News with Generative AI: Optimization

Kangaroo: A flash cache optimized for tiny objects (2021) (engineering.fb.com)
Kangaroo is a new flash cache that enables more efficient caching of tiny objects (objects that are ~100 bytes or less) and overcomes the challenges presented by existing flash cache designs.
Loading Pydantic models from JSON without running out of memory (pythonspeed.com)
You have a large JSON file, and you want to load the data into Pydantic. Unfortunately, this uses a lot of memory, to the point where large JSON files are very difficult to read. What to do?
Fast Allocations in Ruby 3.5 (railsatscale.com)
Many Ruby applications allocate objects. What if we could make allocating objects six times faster? We can! Read on to learn more!
Improving performance of rav1d video decoder (ohadravid.github.io)
Making the rav1d Video Decoder 1% Faster
Fast Allocations in Ruby 3.5 (railsatscale.com)
Many Ruby applications allocate objects. What if we could make allocating objects six times faster? We can! Read on to learn more!
Too Much Go Misdirection (tedunangst.com)
Poking through layers of indirection in go trying to recover some efficiency.
Layers All the Way Down: The Untold Story of Shader Compilation (moonside.games)
As a game developer who works primarily in frameworks instead of engines, one of the biggest pain points is the need to render on multiple platforms efficiently.
Backtrace is finally cheap by abusing x86/Linux's shadow stack (intmainreturn0.com)
Backtrace is a very helpful debugging tool in native programming by giving out the source location at each call level. Unfortunately, getting a backtrace is expensive.
Show HN: KVSplit – Run 2-3x longer contexts on Apple Silicon (github.com/dipampaul17)
Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism's KV cache. KVSplit enables you to:
New Life Hack: Using LLMs and Constraint Solvers for Personal Logistics Tasks (emschwartz.me)
I enjoy doing escape rooms and was planning to do a couple of them with a group of friends this weekend. The very minor and not-very-important challenge, however, was that I couldn't figure out how to assign friends to rooms. I want to do at least one room with each person, different people are arriving and leaving at different times, and there are only so many time slots.
X X^t can be faster (arxiv.org)
We present a new algorithm RXTX that computes product of matrix by its transpose $XX^{t}$. RXTX uses $5\%$ less multiplications and additions than State-of-the-Art and achieves accelerations even for small sizes of matrix $X$. The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial Optimization.
Solving the local optima problem – NQueens (github.com/Dpbm)
You can’t perform that action at this time.
A leap year check in three instructions (hueffner.de)
With the following code, we can check whether a year 0 ≤ y ≤ 102499 is a leap year with only about 3 CPU instructions:
Determinate Nix 3.5: introducing lazy trees (determinate.systems)
Lazy trees have been one of the most hotly requested Nix features for quite some time. They make Nix much more efficient in larger repositories, particularly in massive monorepos. And so we’re excited to announce that lazy trees have landed in Determinate Nix version 3.5.2, based on version 2.28.3 of upstream Nix.
The smallest possible Docker image (github.com/MarkMcCulloh)
This is (hopefully) the smallest possible docker image that can be successfully executed.
Backslash: Rate Constrained Optimized Training of Large Language Models (arxiv.org)
The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored.
JEP 515: Ahead-of-Time Method Profiling (openjdk.org)
Improve warmup time by making method-execution profiles from a previous run of an application instantly available, when the HotSpot Java Virtual Machine starts. This will enable the JIT compiler to generate native code immediately upon application startup, rather than having to wait for profiles to be collected.
15 Years of Shader Minification (ctrl-alt-test.fr)
How do demosceners create complex computer animations in just a few kilobytes? One of our secret weapons is Shader Minifier, a tool that minifies GLSL code. Over the years, it has evolved to pack more data into tiny executables, pushing the boundaries of what’s possible. In this blog post, we’ll go through its evolution.
A whippet waypoint / Nofl: A Precise Immix (wingolog.org)
Hey peoples! Tonight, some meta-words. As you know I am fascinated by compilers and language implementations, and I just want to know all the things and implement all the fun stuff: intermediate representations, flow-sensitive source-to-source optimization passes, register allocation, instruction selection, garbage collection, all of that.
Engineering Design Optimization Textbook (mdobook.github.io)
A graduate-level textbook covering a range of fundamental to advanced optimization theory and algorithms with practical tips, numerous illustrations, and engineering examples.
Optimizing an HTML5 game engine using composition over inheritance (radicalfishgames.com)
We started with HTML5 game development around the end of 2011. We bought an impact.js license and started working on CrossCode. And since CrossCode demanded 3D collision, we modified the engine – and continued doing so until almost every nook and cranny was changed in one way or the other. So it’s safe to say that we did not only develop a game but a whole game engine with it.
Blazeio.SharpEvent: A Python Async Primitive That Scales to 1M Waiters with O(1) (ycombinator.com)
I’ve been working on a Python async library ([Blazeio](https://github.com/anonyxbiz/Blazeio)) and stumbled into a shockingly simple optimization that makes `asyncio.Event` look like a relic.
Linear Programming for Fun and Profit: Finding Arbitrages in the GPU Market (modal.com)
If you haven’t noticed, the GPU market is highly volatile. NVIDIA repeatedly spews out new chip architectures, doubling FLOPS every few years. Everyone shifts towards the newest cards, causing temporary supply crunches and high prices. But Modal’s customers don’t want to think about these price fluctuations. They want GPUs of all kinds at predictable and good prices, and the ability to demand thousands of GPUs on a moment’s notice, without having to worry about pricing, capacity planning, or supply.
Optimizing Common Lisp (fosskers.ca)
I recently released a Parser Combinator library for Common Lisp, but was unhappy with its performance. This article is a description of how I used sb-sprof, built in to SBCL, to identify both CPU and memory allocation hotspots, improving the runtime speed of the parcom/json module by 3x and decreasing memory allocation by 25x.
Faster sorting with SIMD CUDA intrinsics (2024) (winwang.blog)
Recently, I finished a batch at the Recurse Center… is what I would have said if this post were written when I intended to write it (i.e. 3 months ago). My project there focused on a questionable application of CUDA (mostly irrelevant to this post), but it got me thinking more about other GPU-friendly algorithms.
Load-Store Conflicts (zeux.io)
meshoptimizer implements several geometry compression algorithms that are designed to take advantage of redundancies common in mesh data and decompress quickly - targeting many gigabytes per second in decoding throughput.
Minecraft runs on 8MB of VRAM using a 20-year-old GPU (tomshardware.com)
Fast(er) regular expression engines in Ruby (serpapi.com)
Performance-oriented comparison of alternative regexp engines that may (or may not) speed up your Ruby code.
An illustrated guide to automatic sparse differentiation (iclr-blogposts.github.io)
In numerous applications of machine learning, Hessians and Jacobians exhibit sparsity, a property that can be leveraged to vastly accelerate their computation.
Finding paths of least action with gradient descent (2023) (greydanus.github.io)
The purpose of this simple post is to bring to attention a view of physics which isn’t often communicated in introductory courses: the view of physics as optimization.