Foundation Model for Personalized Recommendation(netflixtechblog.com) Netflix’s personalized recommender system is a complex system, boasting a variety of specialized machine learned models each catering to distinct needs including “Continue Watching” and “Today’s Top Picks for You.” (Refer to our recent overview for more details). However, as we expanded our set of personalization algorithms to meet increasing business needs, maintenance of the recommender system became quite costly.
Tail Call Recursion in Java with ASM (2023)(unlinkedlist.org) One kind of optimization offered by some compilers is tail call optimization. This optimization does not bring much, since the programmer can always tailor his code without recursion, especially in an imperative language. On the other side, recursive code often times more elegant, so why we don’t let the compiler do the nasty stuff when it is possible? In this article I will present a neat way to implement tail call optimization in Java using byte code manipulation with ASM.
96 points by hyperbrainer 4 days ago | 48 comments
HTTP/2 zero latency write coalescing(nitely.github.io) Write coalescing is an I/O optimization technique where multiple small writes are merged into a single larger write before sending data to the underlying system. In Http/2, we can batch multiple frames from one or more streams and send them all at once. This reduces the number of syscalls, and avoids sending tiny TCP packets under load.
Optimizing ML training with metagradient descent(arxiv.org) A major challenge in training large-scale machine learning models is configuring the training process to maximize model performance, i.e., finding the best training setup from a vast design space.
DNS Speed Test(dnsspeedtest.online) Optimize your internet experience by finding the fastest DNS server for your location. Just click the button below to start the test.
C++26 Expansion Tricks(pydong.org) P1306 gives us compile time repetition of a statement for each element of a range - what if we instead want the elements as a pack without introducing a new function scope?
Teardown, Optimization: Comsol 8Gb USB Flash Stick (2015)(goughlui.com) A while back, I received a Comsol 8Gb USB Flash Stick for a test. As it turns out, I’ve managed to grab another, so I felt less bad about breaking one apart to work out what’s inside – and as it turns out, it provided me a world of entertainment for the weekend. It was more than I expected, and the optimization process is something engineers (like myself) really get excited about.
Jagged Flash Attention Optimization(shaped.ai) Meta researchers have introduced Jagged Flash Attention, a novel technique that significantly enhances the performance and scalability of large-scale recommendation systems.
Specializing Python with E-Graphs(vectorfold.studio) We've explored progressively more sophisticated techniques for optimizing numerical computations. We started with basic MLIR concepts, moved through memory management and linear algebra, and then neural network implementations. Each layer has added new capabilities for expressing and optimizing computations. Now we're reading to build our first toy compiler for Python expressions.
211 points by jakey_bakey 19 days ago | 108 comments
Parallel Histogram Computation with CUDA(khushi-411.github.io) The aim of the blog posts is to introduce a parallel histogram pattern, where each output element can be updated by any thread. Therefore, we should coordinate among threads as they update the output value. In this blog post, we will read the introduction about using atomic operations to serialize the updates of each element. Then, we will study an optimization technique: privatization. Let’s dig in!
8 points by coffeeaddict1 22 days ago | 0 comments
How to Minify Godot's Build Size (93MB –> 6.4MB EXE)(bearblog.dev) This was an article I wanted to do for ages. Godot's default file size for web exports is quite massive, and there weren't a lot of guides on how to reduce it aside from the official documentation - which gives general information without any numbers or details, leaving people to figure out how effective any of the solutions really are. It also doesn't mention some more advanced tricks we can use which I'll mention here.
Bypassing the Branch Predictor(nicula.xyz) A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.
Deriving Muon(jeremybernste.in) We recently proposed Muon: a new neural net optimizer. Muon has garnered attention for its excellent practical performance: it was used to set NanoGPT speed records leading to interest from the big labs.
Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”(openpipe.ai) In this post we’ll discuss how we used Group Relative Policy Optimization (GRPO) to surpass R1, o1, o3-mini, and come within a couple percentage points of Sonnet 3.7 on a reasoning-heavy game called “temporal clue”, while being over 100x cheaper to run at inference time. We’ll include specific lessons learned about task design and hyperparameters we’ve found to work well. And finally, we share the training recipe we used to achieve these results, built on top of torchtune.
Succinct data structures(startifact.com) A few months ago, searching for ideas on how to make some code faster, I found myself reading a bunch of computer science papers.