Things you can do to clean up a fresh install of Windows 11 24H2 and Edge
(arstechnica.com)
If you start using Windows 11 this year, you'll want to know how to clean it up.
If you start using Windows 11 this year, you'll want to know how to clean it up.
A Clang regression related to switch statements and inlining
(nicula.xyz)
After my previous post, Eliminating redundant bound checks (read it for context if you haven’t already), I wanted to do a benchmark using the ‘optimized’ version of the increment() function, which didn’t contain any bound checks when compiled with Clang, even though we used .at() for indexing into the array.
After my previous post, Eliminating redundant bound checks (read it for context if you haven’t already), I wanted to do a benchmark using the ‘optimized’ version of the increment() function, which didn’t contain any bound checks when compiled with Clang, even though we used .at() for indexing into the array.
Fast Video Generation with Sliding Tile Attention
(hao-ai-lab.github.io)
TL;DR: Video generation with DiTs is painfully slow – HunyuanVideo takes 16 minutes to generate just a 5-second video on an H100 with FlashAttention3. Our sliding tile attention (STA) slashes this to 5 minutes with zero quality loss, no extra training required. Specifically, STA accelerates attention alone by 2.8–17x over FlashAttention-2 and 1.6–10x over FlashAttention-3.
TL;DR: Video generation with DiTs is painfully slow – HunyuanVideo takes 16 minutes to generate just a 5-second video on an H100 with FlashAttention3. Our sliding tile attention (STA) slashes this to 5 minutes with zero quality loss, no extra training required. Specifically, STA accelerates attention alone by 2.8–17x over FlashAttention-2 and 1.6–10x over FlashAttention-3.
Tensor evolution: A framework for fast tensor computations using recurrences
(arxiv.org)
This paper introduces a new mathematical framework for analysis and optimization of tensor expressions within an enclosing loop.
This paper introduces a new mathematical framework for analysis and optimization of tensor expressions within an enclosing loop.
Making my debug build run 100x faster so that it is finally usable
(gaultier.github.io)
SIMD and dedicated silicon to the rescue.
SIMD and dedicated silicon to the rescue.
(Ab)using general search algorithms on dynamic optimization problems (2023)
(dubovik.eu)
In retrospect, my most ambitious blog yet. As it goes, I was reading “Artificial Intelligence. A Modern Approach” the other day. In one of the earlier chapters the authors discuss general search algorithms: breadth-first search, depth-first search, uniform-cost search (Dijkstra), and variations of those. A bit later they also cover Monte Carlo tree search as a way of finding approximate solutions in big state spaces.
In retrospect, my most ambitious blog yet. As it goes, I was reading “Artificial Intelligence. A Modern Approach” the other day. In one of the earlier chapters the authors discuss general search algorithms: breadth-first search, depth-first search, uniform-cost search (Dijkstra), and variations of those. A bit later they also cover Monte Carlo tree search as a way of finding approximate solutions in big state spaces.
JEP draft: AOT cache command-line ergonomics
(openjdk.org)
Make it easier to create an ahead-of-time cache (as defined by JEP 483) for a Java application, by simplifying the commands required by some common use cases.
Make it easier to create an ahead-of-time cache (as defined by JEP 483) for a Java application, by simplifying the commands required by some common use cases.
AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors
(lemire.me)
The recent AMD processors (Zen 4) provide extensive support for the powerful AVX-512 instructions.
The recent AMD processors (Zen 4) provide extensive support for the powerful AVX-512 instructions.
The largest sofa you can move around a corner
(quantamagazine.org)
A new proof reveals the answer to the decades-old “moving sofa” problem. It highlights how even the simplest optimization problems can have counterintuitive answers.
A new proof reveals the answer to the decades-old “moving sofa” problem. It highlights how even the simplest optimization problems can have counterintuitive answers.
How do modern compilers choose which variables to put in registers?
(stackexchange.com)
How do modern compilers choose which variables to put in registers?
How do modern compilers choose which variables to put in registers?
Rust: Doubling Throughput with Continuous Profiling and Optimization
(polarsignals.com)
“68.37% of CPU [...] with a one-line code change [...] went down to 31.82%”
“68.37% of CPU [...] with a one-line code change [...] went down to 31.82%”
Show HN: Dockershrink – AI Assistant to reduce the size of Docker images
(github.com/duaraghav8)
Dockershrink is an AI-powered Commandline Tool that helps you reduce the size of your Docker images
Dockershrink is an AI-powered Commandline Tool that helps you reduce the size of your Docker images
Explaining my fast 6502 code generator (2023)
(pubby.games)
To learn how optimizing compilers are made, I built one targeting the 6502 architecture. In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to.
To learn how optimizing compilers are made, I built one targeting the 6502 architecture. In a bizarre twist, my compiler generates faster code than GCC, LLVM, and every other compiler I compared it to.
Git clone –depth 2 is vastly better than –depth 1 if you want to Git push later
(stackoverflow.com)
I've done a shallow clone of a large repo (git clone --depth 1 [email protected]:myOrg/myRepo.git). I can push new changes to the remote but the first push is horribly slow. Subsequent pushes are fine. The command indicates that the first push writes a lot of data to the remote:
I've done a shallow clone of a large repo (git clone --depth 1 [email protected]:myOrg/myRepo.git). I can push new changes to the remote but the first push is horribly slow. Subsequent pushes are fine. The command indicates that the first push writes a lot of data to the remote:
Implementing the President's "DOGE" Workforce Optimization Initiative
(whitehouse.gov)
To restore accountability to the American public, this order commences a critical transformation of the Federal bureaucracy.
To restore accountability to the American public, this order commences a critical transformation of the Federal bureaucracy.
Don't "optimize" conditional moves in shaders with mix()+step()
(iquilezles.org)
In this article I want to correct a popular misconception that's been making the rounds in computer graphics aficionado circles for a long time now.
In this article I want to correct a popular misconception that's been making the rounds in computer graphics aficionado circles for a long time now.
JEP draft: 4-byte Object Headers (Experimental
(openjdk.org)
Reduce the size of object headers in the HotSpot JVM from between 64 and 128 bits down to 32 bits on 64-bit architectures. This will reduce heap size, improve deployment density, and increase data locality.
Reduce the size of object headers in the HotSpot JVM from between 64 and 128 bits down to 32 bits on 64-bit architectures. This will reduce heap size, improve deployment density, and increase data locality.
We are destroying software
(antirez.com)
We are destroying software by no longer taking complexity into account when adding features or optimizing some dimension.
We are destroying software by no longer taking complexity into account when adding features or optimizing some dimension.
Explainable Linear Programs
(jeremykun.com)
Back in 2020, when I worked in the supply chain side of Google, I had a fun and impactful side project related to human-level explanations of linear programs.
Back in 2020, when I worked in the supply chain side of Google, I had a fun and impactful side project related to human-level explanations of linear programs.
Global Optimization of Black-Box Functions with Unknown Lipschitz Constants
(arxiv.org)
Optimizing expensive, non-convex, black-box Lipschitz continuous functions presents significant challenges, particularly when the Lipschitz constant of the underlying function is unknown.
Optimizing expensive, non-convex, black-box Lipschitz continuous functions presents significant challenges, particularly when the Lipschitz constant of the underlying function is unknown.
From hours to 360ms: over-engineering a puzzle solution
(danielh.cc)
In January 2025, Jane Street posted an interesting puzzle:
In January 2025, Jane Street posted an interesting puzzle:
The Longest Nvidia PTX Instruction
(ashvardanian.com)
The race for AI dominance isn’t just about who has the most computing - it’s increasingly about who can use it most efficiently.
The race for AI dominance isn’t just about who has the most computing - it’s increasingly about who can use it most efficiently.
The Slotted Counter Pattern (2020)
(planetscale.com)
It is a common database pattern to increment an INT column when an event happens, such as a download or page view.
It is a common database pattern to increment an INT column when an event happens, such as a download or page view.
Don't Animate Height
(granola.ai)
Our app was mysteriously using 60% CPU and 25% GPU on my M2 MacBook. It turned out this was due to a tiny CSS animation!
Our app was mysteriously using 60% CPU and 25% GPU on my M2 MacBook. It turned out this was due to a tiny CSS animation!
Optimizing with Novel Calendrical Algorithms
(jhpratt.dev)
After exercising some creativity and applying some mathematical tricks, I was able to achieve something that I am more than happy with.
After exercising some creativity and applying some mathematical tricks, I was able to achieve something that I am more than happy with.
Decorator JITs: Python as a DSL
(thegreenplace.net)
Spend enough time looking at Python programs and packages for machine learning, and you'll notice that the "JIT decorator" pattern is pretty popular.
Spend enough time looking at Python programs and packages for machine learning, and you'll notice that the "JIT decorator" pattern is pretty popular.
Optimizing with Novel Calendrical Algorithms
(jhpratt.dev)
I was able to create performant, branchless, and const-compatible algorithms for converting an ordinal date to a calendar date that is 57.5% faster than the previous implementation, the month-only algorithm 43.2% faster, and the day-only algorithm 48.1% faster.
I was able to create performant, branchless, and const-compatible algorithms for converting an ordinal date to a calendar date that is 57.5% faster than the previous implementation, the month-only algorithm 43.2% faster, and the day-only algorithm 48.1% faster.
Fast Rust Builds
(matklad.github.io)
It’s common knowledge that Rust code is slow to compile. But I have a strong gut feeling that most Rust code out there compiles much slower than it could.
It’s common knowledge that Rust code is slow to compile. But I have a strong gut feeling that most Rust code out there compiles much slower than it could.
DeepSeek's multi-head latent attention and other KV cache tricks
(pyspur.dev)
Key-Value caching techniques are central to scaling and optimizing Transformer-based models for real-world use.
Key-Value caching techniques are central to scaling and optimizing Transformer-based models for real-world use.