Hacker News with Generative AI: Benchmarking

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arxiv.org)
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons.
Optimizing Heap Allocations in Go: A Case Study (dolthub.com)
Last month, a commit that was supposed to be a no-op refactor caused a 30% regression in sysbench's types_scan benchmark.
OpenBSD IO Benchmarking: How Many Jobs Are Worth It? (rsadowski.de)
This post explores these questions through detailed fio(1) benchmarking, looking at random reads, random writes, and latency — all running on a recent build of OpenBSD 7.7-current.
Can LLMs earn $1M from real freelance coding work? (getdx.com)
A new benchmark tests AI’s ability to complete real-world software engineering tasks.
We benchmarked GPT-4.1: it's better at code reviews than Claude Sonnet 3.7 (qodo.ai)
As AI coding assistants continue to evolve, one of the most relevant questions today is: which model provides the most helpful, precise, and actionable feedback for developers?
Significant performance improvements with Edge 134 (windows.com)
We’re very proud to say that, starting with version 134, Microsoft Edge is up to 9% faster as measured by the Speedometer 3.0 benchmark.
Nvidia's new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size (venturebeat.com)
Even as Meta fends off questions and criticisms of its new Llama 4 model family, graphics processing unit (GPU) master Nvidia has released a new, fully open source large language model (LLM) based on Meta’s older model Llama-3.1-405B-Instruct model and it’s claiming near top performance on a variety of third-party benchmarks — outperforming the vaunted rival DeepSeek R1 open source reasoning model.
LLM Benchmark for 'Longform Creative Writing' (eqbench.com)
An LLM-judged longform creative writing benchmark (v3). Learn more
Fastify is 7x Faster than Next.js (jonasgalvez.com.br)
Fastify + React is 7x Faster than Next.js April 9, 2025
Nvidia Just Released Llama Nemotron Ultra (ycombinator.com)
NVIDIA just released Llama 3.1 Nemotron Ultra (253B parameter model) that’s showing great performance on GPQA-Diamond, AIME, and LiveCodeBench.
Meta got caught gaming AI benchmarks (theverge.com)
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.
New #1 SOTA on Swe-bench is using Claude 3.7 and O1 (swebench.com)
SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically.
Meta got caught gaming LMArena (theverge.com)
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.
Serving Vector Tiles, Fast (spatialists.ch)
Want to serve #VectorTiles to your users? Fabian Rechsteiner’s benchmark pits six open-source servers (#BBOX, #ldproxy, #Martin, #pg_tileserv, #Tegola, #TiPg) against each other, revealing stark speed differences.
LocalScore: A Local LLM Benchmark (localscore.ai)
Today, I'm excited to announce LocalScore – an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results.
AMD Ryzen 9 9900X3D Impact of the 3D V-Cache Optimizer Linux Driver Review (phoronix.com)
Last month I posted benchmarks showing the performance when using the new 3D V-Cache Optimizer driver on Linux using the flagship Ryzen 9 9950X3D. This optimizer driver allows tuning the "amd_x3d_mode" for indicating your preference for the CCD with the higher frequency or larger cache size. For some additional insight into the 3D V-Cache Optimizer driver performance impact on Linux, here are benchmarks looking at the difference while using the AMD Ryzen 9 9900X3D.
Show HN: LocalScore – Local LLM Benchmark (localscore.ai)
There are two ways to run LocalScore. The easiest way to get started is to download one of the Official Models. If you have .gguf models already you run LocalScore with them.
Show HN: Benchi – A benchmarking tool written in Go (github.com/ConduitIO)
Benchi is a minimal benchmarking framework designed to help you measure the performance of your applications and infrastructure. It leverages Docker to create isolated environments for running benchmarks and collecting metrics.
Show HN: Docsumo's OCR Benchmark Report – Surpassing Mistral and Landing AI (docsumo.com)
In the past month, the AI community witnessed the launch of two much-anticipated OCR solutions—Mistral OCR by the Mistral team (known for their LLMs) and Agentic Document Extraction by Landing AI, Andrew Ng’s company. At Docsumo, we live and breathe Document AI. So when these releases hit the market, we couldn’t resist putting them to the test
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (arxiv.org)
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors.
Show HN: Qwen-2.5-32B is now the best open source OCR model (github.com/getomni-ai)
A benchmarking tool that compares OCR and data extraction capabilities of different large multimodal models such as gpt-4o, evaluating both text and json extraction accuracy. The goal of this benchmark is to publish a comprehensive benchmark of OCR accuracy across traditional OCR providers and multimodal Language Models. The evaluation dataset and methodologies are all Open Source, and we encourage expanding this benchmark to encompass any additional providers.
Dual RTX 5090 Beats $25,000 H100 in Real-World LLM Performance (hardware-corner.net)
AI enthusiasts looking for top-tier performance in local LLMs have long considered NVIDIA’s H100 to be the gold standard for inference, thanks to its high-bandwidth HBM3 memory and optimized tensor cores. However, recent benchmarks show that a dual RTX 5090 setup, while still pricey, outperforms the H100 in sustained output token generation, making it an ideal choice for those seeking the best possible performance for home use, especially for models up to 70B parameters.
New open-source benchmark for real-time analytics applications (github.com/timescale)
DeepSeek V3 0324 outpaces GPT 4.5 and Claude 3.7 in coding, other benchmarks (huggingface.co)
DeepSeek-V3-0324 demonstrates notable improvements over its predecessor, DeepSeek-V3, in several key aspects.
The Curious Case of Beam CPU Usage (2019) (stressgrid.com)
While benchmarking Go vs Elixir vs Node, we discovered that Elixir (running on the BEAM virtual machine) had much higher CPU usage than Go, and yet its responsiveness remained excellent. Some of our readers suggested that busy waiting may be responsible for this behavior.
DeepSeek-V3-0324 Crushes GPT-4.5 in Math and Code Benchmarks at 1/277 the Cost (deepseek.com)
DeepSeek V3 is now the highest scoring non-reasoning model (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
Show HN: BenchFlow – run AI benchmarks as an API (github.com/benchflow-ai)
BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.
Command A (cohere.com)
Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency.
TypeScript-go is now a performance benchmark for the Go compiler (googlesource.com)