New #1 open-source AI Agent on SWE-bench Verified(refact.ai) Refact.ai Agent achieved 69.8% on SWE-bench Verified — autonomously solving 349 out of 500 tasks. This makes Refact.ai a leading open-source AI programming Agent on SWE-bench and places it among the top ranks on the leaderboard.
The fastest Postgres inserts(hatchet.run) At Hatchet, we spent the past half year running hundreds of benchmarks against different Postgres configurations. We set out with a simple question: at what scale does Postgres break?
Which LLM writes the best analytical SQL?(tinybird.co) We asked 19 popular LLMs (+1 human) to write analytical SQL queries to filter and aggregate a 200 million row dataset. The result is the first version of the LLM SQL Generation Benchmark.
Ubuntu 25.04 Advancing Performance of System76 Thelio Astra with Ampere Altra(phoronix.com) With the release of Ubuntu 25.04 this month I've looked at its performance on x86_64 laptops and desktop hardware to nice gains on server. That testing so far was focused on Intel and AMD systems given my abundance of x86_64 platforms. Last week I began testing Ubuntu 25.04 ARM64 on the System76 Thelio Astra powered by Ampere Altra processors.
21 GB/s CSV Parsing Using SIMD on AMD 9950X(nietras.com) Sep 0.10.0 was released April 22nd, 2025 with optimizations for AVX-512 capable CPUs like the AMD 9950X (Zen 5) and updated benchmarks including the 9950X. Sep now achieves a staggering 21 GB/s on the 9950X for the low-level CSV parsing. 🚀 Before 0.10.0, Sep achieved ~18 GB/s on 9950X.
CMU TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks(arxiv.org) To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.
What went into training DeepSeek-R1?(epoch.ai) On January 20th, 2025, DeepSeek released their latest open-weights reasoning model, DeepSeek-R1, which is on par with OpenAI’s o1 in benchmark performance.
Nvidia's new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size(venturebeat.com) Even as Meta fends off questions and criticisms of its new Llama 4 model family, graphics processing unit (GPU) master Nvidia has released a new, fully open source large language model (LLM) based on Meta’s older model Llama-3.1-405B-Instruct model and it’s claiming near top performance on a variety of third-party benchmarks — outperforming the vaunted rival DeepSeek R1 open source reasoning model.
11 points by weavedfreedunes 45 days ago | 2 comments
Serving Vector Tiles, Fast(spatialists.ch) Want to serve #VectorTiles to your users? Fabian Rechsteiner’s benchmark pits six open-source servers (#BBOX, #ldproxy, #Martin, #pg_tileserv, #Tegola, #TiPg) against each other, revealing stark speed differences.
103 points by altilunium 47 days ago | 17 comments
LocalScore: A Local LLM Benchmark(localscore.ai) Today, I'm excited to announce LocalScore – an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results.
9 points by jborichevskiy 49 days ago | 2 comments
AMD Ryzen 9 9900X3D Impact of the 3D V-Cache Optimizer Linux Driver Review(phoronix.com) Last month I posted benchmarks showing the performance when using the new 3D V-Cache Optimizer driver on Linux using the flagship Ryzen 9 9950X3D. This optimizer driver allows tuning the "amd_x3d_mode" for indicating your preference for the CCD with the higher frequency or larger cache size. For some additional insight into the 3D V-Cache Optimizer driver performance impact on Linux, here are benchmarks looking at the difference while using the AMD Ryzen 9 9900X3D.
Show HN: LocalScore – Local LLM Benchmark(localscore.ai) There are two ways to run LocalScore. The easiest way to get started is to download one of the Official Models. If you have .gguf models already you run LocalScore with them.