Hacker News with Generative AI: Benchmarking

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad (arxiv.org)
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors.
Show HN: Qwen-2.5-32B is now the best open source OCR model (github.com/getomni-ai)
A benchmarking tool that compares OCR and data extraction capabilities of different large multimodal models such as gpt-4o, evaluating both text and json extraction accuracy. The goal of this benchmark is to publish a comprehensive benchmark of OCR accuracy across traditional OCR providers and multimodal Language Models. The evaluation dataset and methodologies are all Open Source, and we encourage expanding this benchmark to encompass any additional providers.
Dual RTX 5090 Beats $25,000 H100 in Real-World LLM Performance (hardware-corner.net)
AI enthusiasts looking for top-tier performance in local LLMs have long considered NVIDIA’s H100 to be the gold standard for inference, thanks to its high-bandwidth HBM3 memory and optimized tensor cores. However, recent benchmarks show that a dual RTX 5090 setup, while still pricey, outperforms the H100 in sustained output token generation, making it an ideal choice for those seeking the best possible performance for home use, especially for models up to 70B parameters.
New open-source benchmark for real-time analytics applications (github.com/timescale)
DeepSeek V3 0324 outpaces GPT 4.5 and Claude 3.7 in coding, other benchmarks (huggingface.co)
DeepSeek-V3-0324 demonstrates notable improvements over its predecessor, DeepSeek-V3, in several key aspects.
The Curious Case of Beam CPU Usage (2019) (stressgrid.com)
While benchmarking Go vs Elixir vs Node, we discovered that Elixir (running on the BEAM virtual machine) had much higher CPU usage than Go, and yet its responsiveness remained excellent. Some of our readers suggested that busy waiting may be responsible for this behavior.
DeepSeek-V3-0324 Crushes GPT-4.5 in Math and Code Benchmarks at 1/277 the Cost (deepseek.com)
DeepSeek V3 is now the highest scoring non-reasoning model (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
Show HN: BenchFlow – run AI benchmarks as an API (github.com/benchflow-ai)
BenchFlow is an Open-source Benchmark Hub and Eval Infra for AI production and benchmark developers.
Command A (cohere.com)
Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency.
TypeScript-go is now a performance benchmark for the Go compiler (googlesource.com)
Java Is Fast, If You Don't Create Many Objects (2022) (vanillajava.blog)
This article looks at a benchmark passing events over TCP/IP at 4 billion events per minute using the net.openhft.chronicle.wire.channel package in Chronicle Wire and why we still avoid object allocations..
Llama.cpp AI Performance with the GeForce RTX 5090 Review (phoronix.com)
In beginning the NVIDIA Blackwell Linux testing with the GeForce RTX 5090 compute performance, besides all the CUDA/OpenCL/OptiX benchmarks delivered last week a number of readers asked about AI performance and in particular the Llama.cpp performance with the RTX 5090 flagship graphics card.
Evaluating Mistral OCR Against Gemini 2.0 Flash (reducto.ai)
Today, Mistral AI released a new OCR model, claiming to be state-of-the-art (SOTA) on unreleased benchmarks. We decided to put the model to the test.
Show HN: AI Browser Agent Leaderboard (steel.dev)
See how various AI browser agents stack up based on their accuracy in completing web-based tasks on the WebVoyager benchmark.
People are using Super Mario to benchmark AI now (techcrunch.com)
Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.
Turbocharging V8 with mutable heap numbers · V8 (v8.dev)
At V8, we're constantly striving to improve JavaScript performance. As part of this effort, we recently revisited the JetStream2 benchmark suite to eliminate performance cliffs.
New SOTA on SimpleQA (linkup.so)
Linkup has achieved state-of-the-art (SOTA) performance on OpenAI's SimpleQA benchmark, scoring 90.10%.
Show HN: Benchmarking VLMs vs. Traditional OCR (getomni.ai)
Are LLMs a total replacement for traditional OCR models? It's been an increasingly hot topic, especially with models like Gemini 2.0 becoming cost competitive with traditional OCR.
A Clang regression related to switch statements and inlining (nicula.xyz)
After my previous post, Eliminating redundant bound checks (read it for context if you haven’t already), I wanted to do a benchmark using the ‘optimized’ version of the increment() function, which didn’t contain any bound checks when compiled with Clang, even though we used .at() for indexing into the array.
SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork (arxiv.org)
We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts.
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges (arxiv.org)
As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers.
ZeroBench: An Impossible Visual Benchmark for Contemporary LMMs (arxiv.org)
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals.
Benchmarking vision-language models on OCR in dynamic video environments (arxiv.org)
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments.
ASTRA: HackerRank's coding benchmark for LLMs (hackerrank.com)
HackerRank’s ASTRA benchmark is composed of multi-file, project-based problems designed to closely mimic real-world coding tasks.
Lzbench compression benchmark (morotti.github.io)
lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors.
LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21 (github.com/lechmazur)
This benchmark evaluates large language models (LLMs) based on how frequently they produce non-existent answers (confabulations or hallucinations) in response to misleading questions that are based on provided text documents.
Show HN: OLake[open source] Fastest database to Iceberg data replication tool (ycombinator.com)
Hi HN,<p>Today we’re excited to introduce OLake (github.com/datazip-inc/olake, 130+ and growing fast), an open-source tool built to help you replicate Database (MongoDB, for now, mysql and postgres under development) data into Data Lakehouse at faster speed without any hassle of managing Debezium or kafka (at least 10x faster than Airbyte and Fivetran at fraction of the cost, refer docs for benchmarks - https://olake.io/docs/connectors/mongodb/benchmarks).
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models (arxiv.org)
Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge.
Run Deepseek from fast NVMe drives (github.com/BlinkDL)
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.