Hacker News with Generative AI: Benchmarking

A Clang regression related to switch statements and inlining (nicula.xyz)
After my previous post, Eliminating redundant bound checks (read it for context if you haven’t already), I wanted to do a benchmark using the ‘optimized’ version of the increment() function, which didn’t contain any bound checks when compiled with Clang, even though we used .at() for indexing into the array.
SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork (arxiv.org)
We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts.
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges (arxiv.org)
As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers.
ZeroBench: An Impossible Visual Benchmark for Contemporary LMMs (arxiv.org)
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals.
Benchmarking vision-language models on OCR in dynamic video environments (arxiv.org)
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments.
ASTRA: HackerRank's coding benchmark for LLMs (hackerrank.com)
HackerRank’s ASTRA benchmark is composed of multi-file, project-based problems designed to closely mimic real-world coding tasks.
Lzbench compression benchmark (morotti.github.io)
lzbench is an in-memory benchmark of open-source LZ77/LZSS/LZMA compressors.
LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21 (github.com/lechmazur)
This benchmark evaluates large language models (LLMs) based on how frequently they produce non-existent answers (confabulations or hallucinations) in response to misleading questions that are based on provided text documents.
Show HN: OLake[open source] Fastest database to Iceberg data replication tool (ycombinator.com)
Hi HN,<p>Today we’re excited to introduce OLake (github.com/datazip-inc/olake, 130+ and growing fast), an open-source tool built to help you replicate Database (MongoDB, for now, mysql and postgres under development) data into Data Lakehouse at faster speed without any hassle of managing Debezium or kafka (at least 10x faster than Airbyte and Fivetran at fraction of the cost, refer docs for benchmarks - https://olake.io/docs/connectors/mongodb/benchmarks).
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models (arxiv.org)
Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge.
Run Deepseek from fast NVMe drives (github.com/BlinkDL)
Prepare for DeekSeek R1 inference: Benchmark CPU, DRAM, SSD, iGPU, GPU, ... with efficient code.
Humanity's Last Exam (safe.ai)
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.
I compared my daughter against SOTA models on math puzzles (michalprzadka.com)
I created an AI math reasoning benchmark using puzzles from this year’s GMIL competition — a long-running international mathematical challenge that I participated in myself back in 1998. The results are quite interesting: some of the most advanced AI models performed comparably to my 11-year-old daughter, while others struggled significantly. This experiment gives some amusing insights into current AI capabilities in mathematical reasoning, especially when compared to human performance at the middle school level.
A RISC-V Progress Check: Benchmarking P550 and C910 (chipsandcheese.com)
RISC-V has seen a flurry of activity over the past few years. Most RISC-V implementations have been small in-order cores. Western Digital’s SweRV and Nvidia’s RV-RISCV are good examples. But cores like those are meant for small microcontrollers, and the average consumer won’t care which core a company selects for a GPU or SSD’s microcontrollers. Flagship cores from AMD, Arm, Intel, and Qualcomm are more visible in our daily lives, and use large out-of-order execution engines to deliver high performance.
Results of "Humanity's Last Exam" benchmark published (scale.com)
Scale AI and the Center for AI Safety (CAIS) are proud to publish the results of Humanity’s Last Exam, a groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise.
Humanity's Last Exam (lastexam.ai)
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.
Some Lessons from the OpenAI FrontierMath Debacle (lesswrong.com)
Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.
DeepSeek-R1-Distill-Qwen-1.5B Surpasses GPT-4o in certain benchmarks (huggingface.co)
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.
OpenAI funded independent math benchmark before setting record with o3 (the-decoder.com)
OpenAI's involvement in funding FrontierMath, a leading AI math benchmark, only came to light when the company announced its record-breaking performance on the test. Now, the benchmark's developer Epoch AI acknowledges they should have been more transparent about the relationship.
Boosting Computational Fluid Dynamics Performance with AMD MI300X (blogs.amd.com)
This blog will guide you, step-by-step, through the process of installing and running benchmarks with Ansys Fluent and AMD MI300X. We start with an overview of the Ansys Fluent CFD application and then show you how to set up an AMD MI300X system to run benchmarks. The blog benchmarks results demonstrate the dramatic impact the MI300X has on speeding up simulations, improving design efficiency, and reducing costs in the automotive, aerospace, and environmental engineering industries.
The Two Word Test as a semantic benchmark for large language models (nature.com)
Large language models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests.
Understanding JVM Garbage Collector Performance (mill-build.org)
Garbage collectors are a core part of many programming languages. While they generally work well, on occasion when they go wrong they can fail in very unintuitive ways. This article will discuss the fundamental design of how garbage collectors work, and tie it to real benchmarks of how GCs perform on the Java Virtual Machine.
SOTA on swebench-verified: relearning the bitter lesson (aide.dev)
SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically.
Notes on the New Deepseek v3 (composio.dev)
Deepseek released their flagship model, v3, a 607B mixture-of-experts model with 37B active parameters. Currently, it is the best open-source model, beating Llama 3.1 405b, Qwen, and Mistral. It is on par with OpenAI GPT-4o and Claude 3.5 Sonnet from the benchmarks. The first model performs on par and better at some tasks than the big closed models.
30% drop in O1-preview accuracy when Putnam problems are slightly variated (openreview.net)
As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated.
Benchmarking RSA Key Generation (filippo.io)
RSA key generation is both conceptually simple, and one of the worst implementation tasks of the field of cryptography engineering. Even benchmarking it is tricky, and involves some math: here’s how we generated a stable but representative “average case” instead of using the ordinary statistical approach.
Show HN: Made a small JavaScript benchmarking app – BenchJS (benchjs.com)
Intel's Linux Performance Optimizations Continue Paying Off for AMD EPYC (phoronix.com)
As part of my end-of-year benchmarking and various historical comparisons, over the holidays I was curious to take a look at how the mature AMD EPYC 9004 "Genoa" performance has evolved over the past two years under Linux.
Reflecting on o3 "beating ARC": are we reliving the ImageNet 2012 moment again? (ycombinator.com)
AlexNet came and blown everything out of the water. Then you can reflect how much [a lot] progress there has been since 2012 till now just on this little dataset.<p>o3 beating ARC is such a harder dataset, I don't even want to compare them. So how much progress there will be from just this?<p>Next 10 years gonna be bonkers.
The Performance Benefits of Linux 6.12 LTS over Linux 6.6 LTS (phoronix.com)
Linux 6.12 was recently promoted to being this year's Long Term Support (LTS) kernel with it being the last major kernel release of 2024. For those enterprise Linux users, hyperscalers, and others typically jumping from one annual LTS kernel to the next, in this holiday article are some benchmarks looking at the performance benefits of Linux 6.12 LTS compared to Linux 6.6 LTS while testing on an AMD Ryzen Threadripper workstation.