Hacker News with Generative AI: Performance

DeepSeek-R1-Lite-Preview is live: o1-preview-level performance on AIME and MATH (twitter.com)
Why is Apple Rosetta 2 fast? (2022) (wordpress.com)
Rosetta 2 is remarkably fast when compared to other x86-on-ARM emulators.
Story-time: C++, bounds checking, performance, and compilers (chandlerc.blog)
Recently, several of my colleagues at Google shared the story of how we are retrofitting spatial safety onto our monolithic C++ codebase: https://security.googleblog.com/2024/11/retrofitting-spatial-safety-to-hundreds.html
SQLite vs. PostgreSQL Performance [video] (youtube.com)
Ubuntu Praises 5~7% PGO Compiler Optimization Performance Benefits (phoronix.com)
Over the past year we have seen Canonical engineers focus more on optimizing the performance potential of Ubuntu Linux.
Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference (cerebras.ai)
Frontier AI now runs at instant speed. Last week we ran a customer workload on Llama 3.1 405B at 969 tokens/s – a new record for Meta’s frontier model. Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world – 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet. In addition, we achieved the highest performance at 128K context length and shortest time-to-first-token latency, as measured by Artificial Analysis.
Linux 6.13 Quadrupling Workqueue Concurrency Limit (phoronix.com)
The Linux kernel Workqueue (WQ) is used for handling asynchronous process execution. For the past many years there has been an upper limit on the number of workqueue execution contexts per CPU at 512, but with Linux 6.13 that is being quadrupled to a limit of 2048.
The Fastest Redis Ever (redis.io)
We’re happy to announce the second milestone of Redis 8, our most advanced and performant offering yet, available for you to try in Community Edition (CE) today.
Thanks, Linus. Torvalds patch improves Linux performance by 2.6% (theregister.com)
A relatively tiny code change by penguin premier Linus Torvalds is making a measurable improvement to Linux's multithreaded performance.
CPython's Garbage Collector and Its Impact on Application Performance (codingconfessions.com)
Learn how the knowledge of CPython internals translate into performance insights for your code
M4 chips: E and P cores (eclecticlight.co)
In the two previous articles (links at the end), I explored some of the features and properties of Performance (P) cores in Apple’s latest M4 chips. This article looks at their Efficiency (E) cores by comparison.
Nokolexbor: Drop-in replacement for Nokogiri. 5.2x faster at parsing HTML (github.com/serpapi)
Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.
Optimizers: The Low-Key MVP (duckdb.org)
TL;DR: The query optimizer is an important part of any analytical database system as it provides considerable performance improvements compared to hand-optimized queries, even as the state of your data changes.
Dagger 0.14 (dagger.io)
Today we are introducing version 0.14 of the Dagger Engine. In this release we introduce a powerful new way to authenticate against your existing Git infrastructure; an API to better integrate with test tooling; better OCI interop; a more flexible networking model; performance improvements; CPU metrics; public traces; and more.
Shenandoah GC (openjdk.org)
Shenandoah is the low pause time garbage collector that reduces GC pause times by performing more garbage collection work concurrently with the running Java program.
First Impressions: Lenovo T14s with Qualcomm Snapdragon ARM64 CPU (freebsd.org)
Those of you how know me, know that I am not a big fan of the X86 architecture, which I think is a bad mess, mangled by market power considerations, rather than the CPU architecture this world actually needs, in particular in terms of performance/energy ratio.
We don't need to use what we make (sive.rs)
For many years, I was a touring musician, performing live on stage every week.
Wirth's Law (wikipedia.org)
Wirth's law is an adage on computer performance which states that software is getting slower more rapidly than hardware is becoming faster.
Uncached Buffered IO Is Performing Great, Working Now on Btrfs / EXT4 / XFS (phoronix.com)
As covered last week Linux I/O expert Jens Axboe has been taking a fresh pursuit of uncached buffered I/O for Linux.
Backblaze Rate Limiting Policy for Consistent Performance (backblaze.com)
Highways have lanes for a reason. The lanes help ensure that large volumes of traffic can reach their destinations quickly and safely. And they support order and predictability in systems where some folks want (or need) to go NASCAR fast and others like myself a little less so.
Windows Kills SMB Speeds When Using Tailscale (danthesalmon.com)
Yesterday when trying to transfer an ISO file from a TrueNAS SMB share, I was getting horrible transfer speeds.
Apple M4 Mac Mini with macOS vs. Intel / AMD with Ubuntu Linux Performance (phoronix.com)
Apple last week released their latest iMac, Mac Mini, and MacBook Pro products powered by their fourth-generation M-series Apple Silicon. The new Mac Mini in particular is interesting for under $600 starting out with the all re-designed Mac Mini with 10-core M4 and now the base model having 16GB of memory.
Micron launches 60TB PCIe gen5 SSD with 12GB/s read speeds (micron.com)
The Micron 6550 ION SSD is the world’s first 60TB PCIe Gen5 data center SSD, built to deliver unparalleled performance, energy efficiency and density.
M4 Mac mini's efficiency (jeffgeerling.com)
I had to pause some of my work getting a current-gen AMD graphics card running on the Pi 5 and testing a 192-core AmpereOne server to quickly post on the M4's efficiency.
Intel Spots 3888.9% Performance Improvement in Linux Kernel from 1 Line of Code (phoronix.com)
Intel's Linux kernel test robot has reported a 3888.9% performance improvement in the mainline Linux kernel as of this past week.
UserBenchmark suggests you buy the i5-13600K over the Ryzen 7 9800X3D (tomshardware.com)
Inside M4 chips: P cores (eclecticlight.co)
This is the first in a series diving deeper into Apple’s new M4 family of chips. This starts with details of its Performance (P) cores. Comparisons of their performance against cores in earlier M-series chips will follow separately when I have completed them.
Assembly Optimization Tips by Mark Larson (2004) (masm32.com)
The most important thing to remember is to TIME your code. Trying different tricks might or might not speed up your code. So it is very important to time your code to see if you do get a speedup as you try each trick.
High Power Mode for M4 Pro Macs (mjtsai.com)
Apple has added a High Power performance mode for the M4 Pro both in the Mac mini and in the new MacBook Pros.
Rust vs. C vs. Go runtime speed comparison (2023) (rust-lang.org)