Hacker News with Generative AI: Benchmarking

What's the best AV1 encoder in 2025? I encoded four thousand GIFs to find out (catskull.net)
What’s the best AV1 encoder in 2025?

Video Compression, AV1, Benchmarking, GIFs

3 points by catskull 169 days ago | 0 comments

Datadog opens sources a SOTA time series model and 350M point benchmark (datadoghq.com)
We are excited to announce a new open-weights release of Toto, our state-of-the-art time series foundation model (TSFM), and BOOM, a new public observability benchmark that contains 350 million observations across 2,807 real-world time series.

Open Source, Machine Learning, Benchmarking, Observability

13 points by chrisdevs 170 days ago | 1 comments

New #1 open-source AI Agent on SWE-bench Verified (refact.ai)
Refact.ai Agent achieved 69.8% on SWE-bench Verified — autonomously solving 349 out of 500 tasks. This makes Refact.ai a leading open-source AI programming Agent on SWE-bench and places it among the top ranks on the leaderboard.

AI, Open Source, Software Engineering, Benchmarking

28 points by laxyz 170 days ago | 15 comments

Terminal-Bench: a benchmark for AI agents in terminal environments (tbench.ai)
terminal-bench is a collection of tasks and an evaluation harness to help agent makers quantify their agents' terminal mastery.

Artificial Intelligence, Benchmarking, Terminal Environments

17 points by mikemerrill 173 days ago | 3 comments

The fastest Postgres inserts (hatchet.run)
At Hatchet, we spent the past half year running hundreds of benchmarks against different Postgres configurations. We set out with a simple question: at what scale does Postgres break?

Database Optimization, Postgres, Performance, Benchmarking

59 points by abelanger 176 days ago | 25 comments

Which LLM writes the best analytical SQL? (tinybird.co)
We asked 19 popular LLMs (+1 human) to write analytical SQL queries to filter and aggregate a 200 million row dataset. The result is the first version of the LLM SQL Generation Benchmark.

SQL, Benchmarking

5 points by hn1986 176 days ago | 0 comments

CachyOS, Clear Linux and Debian Deliver the Best Performance on Framework Laptop (phoronix.com)
Debian 13 Testing did pick up a few wins for the OpenSSL benchmarks.

Linux Distributions, Benchmarking, Laptop Performance, Debian

5 points by hochmartinez 177 days ago | 1 comments

Beating the fastest lexer generator in Rust (alic.dev)
I was recently made aware of a crate for writing efficient lexers in Rust called logos.

Rust, Programming Languages, Lexers, Performance, Benchmarking

4 points by fanf2 181 days ago | 0 comments

Ubuntu 25.04 Advancing Performance of System76 Thelio Astra with Ampere Altra (phoronix.com)
With the release of Ubuntu 25.04 this month I've looked at its performance on x86_64 laptops and desktop hardware to nice gains on server. That testing so far was focused on Intel and AMD systems given my abundance of x86_64 platforms. Last week I began testing Ubuntu 25.04 ARM64 on the System76 Thelio Astra powered by Ampere Altra processors.

Linux, Benchmarking, Hardware, Operating Systems, Performance

4 points by rbanffy 181 days ago | 1 comments

21 GB/s CSV Parsing Using SIMD on AMD 9950X (nietras.com)
Sep 0.10.0 was released April 22nd, 2025 with optimizations for AVX-512 capable CPUs like the AMD 9950X (Zen 5) and updated benchmarks including the 9950X. Sep now achieves a staggering 21 GB/s on the 9950X for the low-level CSV parsing. 🚀 Before 0.10.0, Sep achieved ~18 GB/s on 9950X.

Performance Optimization, CSV Parsing, AMD CPUs, Benchmarking, Software

322 points by zigzag312 183 days ago | 169 comments

LoCoDiff: Natural Long Context Code Benchmark (abanteai.github.io)
LoCoDiff is a novel long-context benchmark with several unique strengths:

Benchmarking, Software, Code, Artificial Intelligence

6 points by ja3k 184 days ago | 0 comments

CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (arxiv.org)
To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.

Artificial Intelligence, Benchmarking, Real-World Applications, Research

7 points by walterbell 188 days ago | 0 comments

What went into training DeepSeek-R1? (epoch.ai)
On January 20th, 2025, DeepSeek released their latest open-weights reasoning model, DeepSeek-R1, which is on par with OpenAI’s o1 in benchmark performance.

Generative AI, Open Source, Benchmarking

13 points by mefengl 195 days ago | 2 comments

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons.

Artificial Intelligence, Benchmarking, Research

14 points by distalx 203 days ago | 0 comments

Optimizing Heap Allocations in Go: A Case Study (dolthub.com)
Last month, a commit that was supposed to be a no-op refactor caused a 30% regression in sysbench's types_scan benchmark.

Go, Programming, Optimization, Benchmarking

54 points by ingve 204 days ago | 10 comments

OpenBSD IO Benchmarking: How Many Jobs Are Worth It? (rsadowski.de)
This post explores these questions through detailed fio(1) benchmarking, looking at random reads, random writes, and latency — all running on a recent build of OpenBSD 7.7-current.

Operating Systems, Benchmarking, Performance, OpenBSD

3 points by adamrt 205 days ago | 0 comments

Can LLMs earn $1M from real freelance coding work? (getdx.com)
A new benchmark tests AI’s ability to complete real-world software engineering tasks.

Software Engineering, Artificial Intelligence, Programming, Benchmarking

25 points by nickwritesit 206 days ago | 18 comments

We benchmarked GPT-4.1: it's better at code reviews than Claude Sonnet 3.7 (qodo.ai)
As AI coding assistants continue to evolve, one of the most relevant questions today is: which model provides the most helpful, precise, and actionable feedback for developers?

Generative AI, Software, Coding, Benchmarking

3 points by simplesort 208 days ago | 0 comments

Significant performance improvements with Edge 134 (windows.com)
We’re very proud to say that, starting with version 134, Microsoft Edge is up to 9% faster as measured by the Speedometer 3.0 benchmark.

Microsoft Edge, Web Browsers, Performance, Benchmarking

60 points by ksec 209 days ago | 71 comments

Nvidia's new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size (venturebeat.com)
Even as Meta fends off questions and criticisms of its new Llama 4 model family, graphics processing unit (GPU) master Nvidia has released a new, fully open source large language model (LLM) based on Meta’s older model Llama-3.1-405B-Instruct model and it’s claiming near top performance on a variety of third-party benchmarks — outperforming the vaunted rival DeepSeek R1 open source reasoning model.

Generative AI, Open Source, Benchmarking, Competition

11 points by hochmartinez 211 days ago | 2 comments

LLM Benchmark for 'Longform Creative Writing' (eqbench.com)
An LLM-judged longform creative writing benchmark (v3). Learn more

Generative AI, Benchmarking, Creative Writing

96 points by vitorgrs 212 days ago | 88 comments

Fastify is 7x Faster than Next.js (jonasgalvez.com.br)
Fastify + React is 7x Faster than Next.js April 9, 2025

Web Development, Javascript, Frameworks, Performance, Benchmarking

7 points by jgalvez 213 days ago | 0 comments

Nvidia Just Released Llama Nemotron Ultra (ycombinator.com)
NVIDIA just released Llama 3.1 Nemotron Ultra (253B parameter model) that’s showing great performance on GPQA-Diamond, AIME, and LiveCodeBench.

AI, NVIDIA, Generative AI, Benchmarking

13 points by devaniranjan 214 days ago | 1 comments

Meta got caught gaming AI benchmarks (theverge.com)
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.

AI, Benchmarking, Competition, Meta

347 points by pseudolus 214 days ago | 161 comments

New #1 SOTA on Swe-bench is using Claude 3.7 and O1 (swebench.com)
SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically.

GitHub, AI, Benchmarking, Software

3 points by knes 214 days ago | 0 comments

Meta got caught gaming LMArena (theverge.com)
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.

Artificial Intelligence, Benchmarking, AI Models, Meta

11 points by weavedfreedunes 214 days ago | 2 comments

Serving Vector Tiles, Fast (spatialists.ch)
Want to serve #VectorTiles to your users? Fabian Rechsteiner’s benchmark pits six open-source servers (#BBOX, #ldproxy, #Martin, #pg_tileserv, #Tegola, #TiPg) against each other, revealing stark speed differences.

Vector Tiles, Benchmarking, Open Source, Geospatial Data

103 points by altilunium 216 days ago | 17 comments

LocalScore: A Local LLM Benchmark (localscore.ai)
Today, I'm excited to announce LocalScore – an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results.

Benchmarking, Open Source, Artificial Intelligence, Software

9 points by jborichevskiy 218 days ago | 2 comments

AMD Ryzen 9 9900X3D Impact of the 3D V-Cache Optimizer Linux Driver Review (phoronix.com)
Last month I posted benchmarks showing the performance when using the new 3D V-Cache Optimizer driver on Linux using the flagship Ryzen 9 9950X3D. This optimizer driver allows tuning the "amd_x3d_mode" for indicating your preference for the CCD with the higher frequency or larger cache size. For some additional insight into the 3D V-Cache Optimizer driver performance impact on Linux, here are benchmarks looking at the difference while using the AMD Ryzen 9 9900X3D.

AMD Ryzen, Linux, Performance, Benchmarking, Drivers

6 points by rbanffy 219 days ago | 0 comments

Show HN: LocalScore – Local LLM Benchmark (localscore.ai)
There are two ways to run LocalScore. The easiest way to get started is to download one of the Official Models. If you have .gguf models already you run LocalScore with them.

Benchmarking, Open Source, AI

124 points by sipjca 219 days ago | 24 comments