Hacker News with Generative AI: Benchmarking

The Two Word Test as a semantic benchmark for large language models (nature.com)
Large language models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests.
Understanding JVM Garbage Collector Performance (mill-build.org)
Garbage collectors are a core part of many programming languages. While they generally work well, on occasion when they go wrong they can fail in very unintuitive ways. This article will discuss the fundamental design of how garbage collectors work, and tie it to real benchmarks of how GCs perform on the Java Virtual Machine.
SOTA on swebench-verified: relearning the bitter lesson (aide.dev)
SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically.
Notes on the New Deepseek v3 (composio.dev)
Deepseek released their flagship model, v3, a 607B mixture-of-experts model with 37B active parameters. Currently, it is the best open-source model, beating Llama 3.1 405b, Qwen, and Mistral. It is on par with OpenAI GPT-4o and Claude 3.5 Sonnet from the benchmarks. The first model performs on par and better at some tasks than the big closed models.
30% drop in O1-preview accuracy when Putnam problems are slightly variated (openreview.net)
As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated.
Benchmarking RSA Key Generation (filippo.io)
RSA key generation is both conceptually simple, and one of the worst implementation tasks of the field of cryptography engineering. Even benchmarking it is tricky, and involves some math: here’s how we generated a stable but representative “average case” instead of using the ordinary statistical approach.
Show HN: Made a small JavaScript benchmarking app – BenchJS (benchjs.com)
Intel's Linux Performance Optimizations Continue Paying Off for AMD EPYC (phoronix.com)
As part of my end-of-year benchmarking and various historical comparisons, over the holidays I was curious to take a look at how the mature AMD EPYC 9004 "Genoa" performance has evolved over the past two years under Linux.
Reflecting on o3 "beating ARC": are we reliving the ImageNet 2012 moment again? (ycombinator.com)
AlexNet came and blown everything out of the water. Then you can reflect how much [a lot] progress there has been since 2012 till now just on this little dataset.<p>o3 beating ARC is such a harder dataset, I don't even want to compare them. So how much progress there will be from just this?<p>Next 10 years gonna be bonkers.
The Performance Benefits of Linux 6.12 LTS over Linux 6.6 LTS (phoronix.com)
Linux 6.12 was recently promoted to being this year's Long Term Support (LTS) kernel with it being the last major kernel release of 2024. For those enterprise Linux users, hyperscalers, and others typically jumping from one annual LTS kernel to the next, in this holiday article are some benchmarks looking at the performance benefits of Linux 6.12 LTS compared to Linux 6.6 LTS while testing on an AMD Ryzen Threadripper workstation.
The MTEB benchmark is dead (twitter.com)
JavaScript Benchmarking Is a Mess (byteofdev.com)
I hate benchmarking code, just like any human (which, at this point, most viewers of this probably aren’t ¯\_(ツ)_/¯). It is much more fun to pretend that your caching of a value increased performance 1000% rather than testing to see what it did. Alas, benchmarking in JavaScript is still necessary, especially as JavaScript is used (when it shouldn’t be?) in more performance-sensitive applications. Unfortunately, due to many of its core architectural decisions, JavaScript doesn’t make benchmarking any easier.
CUDA Moat Still Alive (semianalysis.com)
SemiAnalysis has been on a five-month long quest to settle the reality of MI300X. In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of specifications and Total Cost of Ownership (TCO). However, the reality is that the on paper specs as given below are not representative of performance that can be expected in a real-world environment.
H-Matched: A website tracking shrinking gap between AI and human performance (vercel.app)
OpenAI O3 breakthrough high score on ARC-AGI-PUB (arcprize.org)
OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.
Optimizing Ruby's JSON, Part 1 (byroot.github.io)
I was recently made maintainer of the json gem, and aside from fixing some old bugs, I focused quite a bit on its performance, so that it is now the fastest JSON parser and generator for Ruby on most benchmarks.
Gemini Flash 2.0 Experimental: A bit more accurate, but slower (ycombinator.com)
I just updated my data extraction leaderboard with Gemini Flash 2.0 Experimental. It is quite a bit slower with large input token sizes right now, but a bit more accurate than Gemini 1.5 Flash which is at the top right now.
Exploring inference memory saturation effect: H100 vs. MI300x (dstack.ai)
GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.
Show HN: Trilogy – A Reusable, Composable SQL Experiment (trilogydata.dev)
This demo uses the popular TPD-DS dataset for transactional database benchmarking. You can read more about Trilogy and this benchmark tool here.
Pushing AMD's Infinity Fabric to Its Limits (chipsandcheese.com)
I recently wrote code to test memory latency under load, seeking to reproduce data in various presentations with bandwidth on the X axis and latency on the Y axis. Ampere pretty much described how that was done during their Hot Chips 2024 presentation. To achieve the same results in a semi-automated fashion, I run a latency test thread while also running a variable number of threads that generate bandwidth load.
Reliably Benchmarking Small Changes – Ankush Menat (ankush.dev)
I often have to benchmark web services to see if some small change has a meaningful impact on performance. Typically, you spawn a few web service workers and use another program (ideally on another machine in the same network) to hammer that service. During this time, the test program will keep track of how many requests were processed and the latency for each of them. If throughput goes up and/or latency goes down, your change was effective.
DeepSeek-R1-Lite-Preview is live: o1-preview-level performance on AIME and MATH (twitter.com)
Hyperfine: A command-line benchmarking tool (github.com/sharkdp)
A command-line benchmarking tool.
I got WireGuard to hit 8 Gbps in tests, outperforming legacy solutions by 20x (netmaker.io)
Netmaker is a VPN that relies on WireGuard to forge fast, secure connections between devices and networks. WireGuard has demonstrated superior performance in industry speed tests, and so we wanted to run our own tests to determine how Netmaker performs against pure WireGuard, as well as other standard VPN alternatives.
New secret math benchmark stumps AI models and PhDs alike (arstechnica.com)
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI.
New secret math benchmark stumps AI models and PhDs alike (arstechnica.com)
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI.
UserBenchmark suggests you buy the i5-13600K over the Ryzen 7 9800X3D (tomshardware.com)
FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI (epochai.org)
FrontierMath presents hundreds of unpublished, expert-level mathematics problems that specialists spend days solving. It offers an ongoing measure of AI complex mathematical reasoning progress.
Rd-TableBench – Accurately evaluating table extraction (reducto.ai)
RD-TableBench is an open benchmark to help teams evaluate extraction performance for complex tables.
Early Apple M4 Pro and M4 Max benchmarks hint at a performance boost (neowin.net)
After months of swirling rumors, Apple revealed its new Mac devices with updated M4 chips last week.