Hacker News with Generative AI: Benchmarking

DeepSeek-R1-Lite-Preview is live: o1-preview-level performance on AIME and MATH (twitter.com)
Hyperfine: A command-line benchmarking tool (github.com/sharkdp)
A command-line benchmarking tool.
I got WireGuard to hit 8 Gbps in tests, outperforming legacy solutions by 20x (netmaker.io)
Netmaker is a VPN that relies on WireGuard to forge fast, secure connections between devices and networks. WireGuard has demonstrated superior performance in industry speed tests, and so we wanted to run our own tests to determine how Netmaker performs against pure WireGuard, as well as other standard VPN alternatives.
New secret math benchmark stumps AI models and PhDs alike (arstechnica.com)
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI.
New secret math benchmark stumps AI models and PhDs alike (arstechnica.com)
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI.
UserBenchmark suggests you buy the i5-13600K over the Ryzen 7 9800X3D (tomshardware.com)
FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI (epochai.org)
FrontierMath presents hundreds of unpublished, expert-level mathematics problems that specialists spend days solving. It offers an ongoing measure of AI complex mathematical reasoning progress.
Rd-TableBench – Accurately evaluating table extraction (reducto.ai)
RD-TableBench is an open benchmark to help teams evaluate extraction performance for complex tables.
Early Apple M4 Pro and M4 Max benchmarks hint at a performance boost (neowin.net)
After months of swirling rumors, Apple revealed its new Mac devices with updated M4 chips last week.
Linus Torvalds Lands 2.6% Performance Improvement with Minor Linux Kernel Patch (phoronix.com)
Linus Torvalds merged a patch on Wednesday that he authored that with reworking a few lines of code is able to score a 2.6% improvement within Intel's well-exercise "will it scale" per-thread-ops benchmark test case.
Linus Torvalds Lands 2.6% Performance Improvement with Minor Linux Kernel Patch (phoronix.com)
Linus Torvalds merged a patch on Wednesday that he authored that with reworking a few lines of code is able to score a 2.6% improvement within Intel's well-exercise "will it scale" per-thread-ops benchmark test case.
Benchmarking Ruby Parsers (eregon.me)
The new Prism parser has become the default in Ruby 3.4.0 preview 2.
Python 3.12 vs. Python 3.13 – performance testing (lewoniewski.info)
This article describes the performance testing results of Python 3.13 compared to Python 3.12. A total of 100 various benchmark tests were conducted on computers with the AMD Ryzen 7000 series and the 13th-generation of Intel Core processors for desktops, laptops or mini PCs.
AI PCs Aren't Good at AI: The CPU Beats the NPU (github.com/usefulsensors)
I’ve long been a fan of Qualcomm’s NPUs, and I even collaborated with them to get experimental support for the underlying HVX DSP into TensorFlow back in 2017 (traces remain here). That meant I was very excited when I heard they were bringing those same accelerators to Windows tablets, offering up to 45 trillion ops per second.
Nvidia Outperforms GPT-4o with Open Source Model (github.com/lmarena)
Arena-Hard-Auto-v0.1 (See Paper) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.
Open-source 70B model surpass GPT-4o and Claude 3.5 on Arena Hard (huggingface.co)
Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
A benchmark of three different floating point packages for the 6809 (conman.org)
I recently came across another floating point package for the 6809 (written by Lennart Benschop) and I wanted to see how it stacked up against IEEE-754 and BASIC floating point math.
AMD EPYC 9965 Delivers Better Performance/Power Efficiency vs AmpereOne 192-Core (phoronix.com)
Complementing the AMD EPYC 9575F / 9755 / 9965 performance benchmarks article looking at those Turin processors up against prior AMD EPYC CPUs and the Intel Xeon competition, this article is looking squarely at the 192-core EPYC 9965 "Turin Dense" processor compared to Ampere Computing's AmpereOne A192-32X flagship processor.
Benchmarking Llama 3.1 405B on 8x AMD MI300X GPUs (dstack.ai)
At dstack, we've been adding support for AMD GPUs with SSH fleets, so we saw this as a great chance to test our integration by benchmarking AMD GPUs. Our friends at Hot Aisle, who build top-tier bare metal compute for AMD GPUs, kindly provided the hardware for the benchmark.
We Compared ScyllaDB and Memcached and We Lost (scylladb.com)
Engineers behind ScyllaDB – the database for predictable performance at scale – joined forces with Memcached maintainer dormando to compare both technologies head-to-head, in a collaborative vendor-neutral way.
AI Battle: Ranking LLMs by how well they chain multiple tools to solve tasks (scale.com)
To bridge this gap and better align the needs of AI applications with the capabilities of benchmarks, we introduce ToolComp, a tool-use benchmark designed to meet the evolving demands of agentic model makers seeking to rigorously test and scale their models in practical, dynamic environments.
An extensive benchmark of C and C++ hash tables (jacksonallan.github.io)
Although thorough hash-table benchmarks have already been published by others in recent years, I have decided to contribute another benchmarking suite for several reasons.
Show HN: Mitata – Benchmarking tooling for JavaScript (github.com/evanwashere)
benchmark tooling that loves you ❤️
Prem-SQL-1B First 1B SLM Rivals 4o, Outperforms Qwen-7B on BirdBench Private Set (github.com/premAI-io)
PremSQL is an open-source library designed to help developers create secure, fully local Text-to-SQL solutions using small language models.
Intel Xeon 6900P Reasserts Intel Server Leadership (servethehome.com)
Benchmarking the CLOS (djhaskin.com)
Large Text Compression Benchmark (mattmahoney.net)
This competition ranks lossless data compression programs by the compressed size (including the size of the decompression program) of the first 109 bytes of the XML text dump of the English version of Wikipedia on Mar. 3, 2006.
iPhone 16's A18 Pro chip outperforms the M1 chip (9to5mac.com)
We got our first look at a Geekbench result from the iPhone 16 earlier this week, with somewhat disappointing results.
OpenAI o1 Results on ARC-AGI-Pub (arcprize.org)
AGI progress has stalled. New ideas are needed.
Bombardier: Fast cross-platform HTTP benchmarking tool written in Go (github.com/codesenberg)
bombardier is a HTTP(S) benchmarking tool. It is written in Go programming language and uses excellent fasthttp instead of Go's default http library, because of its lightning fast performance.