Hacker News with Generative AI: Benchmarking

What's the best AV1 encoder in 2025? I encoded four thousand GIFs to find out (catskull.net)
What’s the best AV1 encoder in 2025?
Datadog opens sources a SOTA time series model and 350M point benchmark (datadoghq.com)
We are excited to announce a new open-weights release of Toto, our state-of-the-art time series foundation model (TSFM), and BOOM, a new public observability benchmark that contains 350 million observations across 2,807 real-world time series.
New #1 open-source AI Agent on SWE-bench Verified (refact.ai)
Refact.ai Agent achieved 69.8% on SWE-bench Verified — autonomously solving 349 out of 500 tasks. This makes Refact.ai a leading open-source AI programming Agent on SWE-bench and places it among the top ranks on the leaderboard.
Terminal-Bench: a benchmark for AI agents in terminal environments (tbench.ai)
terminal-bench is a collection of tasks and an evaluation harness to help agent makers quantify their agents' terminal mastery.
The fastest Postgres inserts (hatchet.run)
At Hatchet, we spent the past half year running hundreds of benchmarks against different Postgres configurations. We set out with a simple question: at what scale does Postgres break?
Which LLM writes the best analytical SQL? (tinybird.co)
We asked 19 popular LLMs (+1 human) to write analytical SQL queries to filter and aggregate a 200 million row dataset. The result is the first version of the LLM SQL Generation Benchmark.
CachyOS, Clear Linux and Debian Deliver the Best Performance on Framework Laptop (phoronix.com)
Debian 13 Testing did pick up a few wins for the OpenSSL benchmarks.
Beating the fastest lexer generator in Rust (alic.dev)
I was recently made aware of a crate for writing efficient lexers in Rust called logos.
Ubuntu 25.04 Advancing Performance of System76 Thelio Astra with Ampere Altra (phoronix.com)
With the release of Ubuntu 25.04 this month I've looked at its performance on x86_64 laptops and desktop hardware to nice gains on server. That testing so far was focused on Intel and AMD systems given my abundance of x86_64 platforms. Last week I began testing Ubuntu 25.04 ARM64 on the System76 Thelio Astra powered by Ampere Altra processors.
21 GB/s CSV Parsing Using SIMD on AMD 9950X (nietras.com)
Sep 0.10.0 was released April 22nd, 2025 with optimizations for AVX-512 capable CPUs like the AMD 9950X (Zen 5) and updated benchmarks including the 9950X. Sep now achieves a staggering 21 GB/s on the 9950X for the low-level CSV parsing. 🚀 Before 0.10.0, Sep achieved ~18 GB/s on 9950X.
LoCoDiff: Natural Long Context Code Benchmark (abanteai.github.io)
LoCoDiff is a novel long-context benchmark with several unique strengths:
CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (arxiv.org)
To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.
What went into training DeepSeek-R1? (epoch.ai)
On January 20th, 2025, DeepSeek released their latest open-weights reasoning model, DeepSeek-R1, which is on par with OpenAI’s o1 in benchmark performance.
Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons.
Optimizing Heap Allocations in Go: A Case Study (dolthub.com)
Last month, a commit that was supposed to be a no-op refactor caused a 30% regression in sysbench's types_scan benchmark.
OpenBSD IO Benchmarking: How Many Jobs Are Worth It? (rsadowski.de)
This post explores these questions through detailed fio(1) benchmarking, looking at random reads, random writes, and latency — all running on a recent build of OpenBSD 7.7-current.
Can LLMs earn $1M from real freelance coding work? (getdx.com)
A new benchmark tests AI’s ability to complete real-world software engineering tasks.
We benchmarked GPT-4.1: it's better at code reviews than Claude Sonnet 3.7 (qodo.ai)
As AI coding assistants continue to evolve, one of the most relevant questions today is: which model provides the most helpful, precise, and actionable feedback for developers?
Significant performance improvements with Edge 134 (windows.com)
We’re very proud to say that, starting with version 134, Microsoft Edge is up to 9% faster as measured by the Speedometer 3.0 benchmark.
Nvidia's new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size (venturebeat.com)
Even as Meta fends off questions and criticisms of its new Llama 4 model family, graphics processing unit (GPU) master Nvidia has released a new, fully open source large language model (LLM) based on Meta’s older model Llama-3.1-405B-Instruct model and it’s claiming near top performance on a variety of third-party benchmarks — outperforming the vaunted rival DeepSeek R1 open source reasoning model.
LLM Benchmark for 'Longform Creative Writing' (eqbench.com)
An LLM-judged longform creative writing benchmark (v3). Learn more
Fastify is 7x Faster than Next.js (jonasgalvez.com.br)
Fastify + React is 7x Faster than Next.js April 9, 2025
Nvidia Just Released Llama Nemotron Ultra (ycombinator.com)
NVIDIA just released Llama 3.1 Nemotron Ultra (253B parameter model) that’s showing great performance on GPQA-Diamond, AIME, and LiveCodeBench.
Meta got caught gaming AI benchmarks (theverge.com)
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.
New #1 SOTA on Swe-bench is using Claude 3.7 and O1 (swebench.com)
SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically.
Meta got caught gaming LMArena (theverge.com)
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.
Serving Vector Tiles, Fast (spatialists.ch)
Want to serve #VectorTiles to your users? Fabian Rechsteiner’s benchmark pits six open-source servers (#BBOX, #ldproxy, #Martin, #pg_tileserv, #Tegola, #TiPg) against each other, revealing stark speed differences.
LocalScore: A Local LLM Benchmark (localscore.ai)
Today, I'm excited to announce LocalScore – an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results.
AMD Ryzen 9 9900X3D Impact of the 3D V-Cache Optimizer Linux Driver Review (phoronix.com)
Last month I posted benchmarks showing the performance when using the new 3D V-Cache Optimizer driver on Linux using the flagship Ryzen 9 9950X3D. This optimizer driver allows tuning the "amd_x3d_mode" for indicating your preference for the CCD with the higher frequency or larger cache size. For some additional insight into the 3D V-Cache Optimizer driver performance impact on Linux, here are benchmarks looking at the difference while using the AMD Ryzen 9 9900X3D.
Show HN: LocalScore – Local LLM Benchmark (localscore.ai)
There are two ways to run LocalScore. The easiest way to get started is to download one of the Official Models. If you have .gguf models already you run LocalScore with them.