Hacker News with Generative AI: GPU

GPU-Driven Clustered Forward Renderer (logdahl.net)
CUDA version of GROMACS is faster on AMD than HIP port (scale-lang.com)
With the release of version 1.3.1, SCALE has reached a major compatibility milestone: the ability to run the CUDA version of GROMACS on AMD GPUs.
Arm's Bifrost Architecture and the Mali-G52 (chipsandcheese.com)
Arm (the company) is best known for its Cortex CPU line. But Arm today has expanded to offer a variety of licensable IP blocks, ranging from interconnects to IOMMUs to GPUs.
Linear Programming for Fun and Profit (modal.com)
If you haven’t noticed, the GPU market is highly volatile. NVIDIA repeatedly spews out new chip architectures, doubling FLOPS every few years. Everyone shifts towards the newest cards, causing temporary supply crunches and high prices. But Modal’s customers don’t want to think about these price fluctuations. They want GPUs of all kinds at predictable and good prices, and the ability to demand thousands of GPUs on a moment’s notice, without having to worry about pricing, capacity planning, or supply.
Doom GPU Flame Graphs (brendangregg.com)
AI Flame Graphs are now open source and include Intel Battlemage GPU support, which means it can also generate full-stack GPU flame graphs for providing new insights into gaming performance, especially when coupled with FlameScope (an older open source project of mine). Here's an example of GZDoom, and I'll start with flame scopes for both CPU and GPU utilization, with details annotated:
GPU Price Tracker (unitedcompute.ai)
Track current prices, specifications, and historical trends for the most popular GPUs
EGPU: Extending eBPF Programmability and Observability to GPUs (aptaracorp.com)
Precise GPU observability and programmability are essential for optimizing performance in AI workloads and other computationally intensive high-performance computing (HPC) applications.
GPU Server with 8 RTX 4090 (a16z.com)
In today’s AI-driven world, the ability to train AI models locally and perform fast inference on GPUs at an optimal cost is more important than ever.
The Asus Ascent GX10 a Nvidia GB10 Mini PC with 128GB of Memory and 200GbE (servethehome.com)
NVIDIA’s platform, previously codenamed Project DIGITS, is a hit at GTC 2025. Apparently, big customers are asking if they can get a DGX Spark thrown in with large GPU purchases. The reason is simple, this is a mini PC form factor that packs an Arm CPU and a Blackwell GPU that are co-packaged, a 128GB LPDDR5x shared memory, multiple ports of USB4, and even a ConnectX-7 NIC for 200GbE clustering.
AMD Radeon RX 9070 Series Linux GPU Compute Performance (phoronix.com)
In addition to the Radeon RX 9070 series Linux gaming/graphics benchmarks with today's embargo lift, I've also spent some time working on some GPU compute benchmarks for these first RDNA4 graphics cards.
DeepSeek open source DeepEP – library for MoE training and Inference (github.com/deepseek-ai)
DeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, which are also as known as MoE dispatch and combine. The library also supports low-precision operations, including FP8.
DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs (github.com/deepseek-ai)
FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.
The Ultra-Scale Playbook: Training LLMs on GPU Clusters (huggingface.co)
Refreshing
Ask HN: Confused about how DeepSeek hurts Nvidia (ycombinator.com)
I’m genuinely confused about why people think Deepseeks results will mean fewer GPUs being needed in the future.
Bilinear down/upsampling, aligning pixel grids, and that infamous GPU half pixel (2021) (bartwronski.com)
See this ugly pixel shift when upsampling a downsampled image? My post describes where it can come from and how to avoid those!
Show HN: A GPU-accelerated MD5 Hash Cracker, Written Using Rust and CUDA (github.com/vaktibabat)
MD5 hash cracking with CUDA and Rust, implemented from scratch
Chinese GPU designers received key technologies from British company Imagination (tomshardware.com)
GPU Glossary (modal.com)
Show HN: Svader – Create GPU-rendered Svelte components (github.com/sockmaster27)
Create GPU-rendered Svelte components with WebGL and WebGPU fragment shaders.
GPU Glossary (modal.com)
We wrote this glossary to solve a problem we ran into working with GPUs here at Modal : the documentation is fragmented, making it difficult to connect concepts at different levels of the stack, like Streaming Multiprocessor Architecture , Compute Capability , and nvcc compiler flags .
Exploring inference memory saturation effect: H100 vs. MI300x (dstack.ai)
GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.
Compilation on the GPU? A Feasibility Study (dl.acm.org)
Compilation on the GPU? A Feasibility Study (2022) (dl.acm.org)
The emergence of highly parallel architectures has led to a renewed interest in parallel compilation.
Scale (run CUDA on AMD GPUs without mods) supports gfx900 and gfx1102 (scale-lang.com)
Optimizing a Rust GPU matmul kernel (rust-gpu.github.io)
I read the excellent post Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance by Zach Nussbaum and thought it might be fun to reimplement it with Rust GPU.
Optimizing a Rust GPU matmul kernel (rust-gpu.github.io)
I read the excellent post Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance by Zach Nussbaum and thought it might be fun to reimplement it with Rust GPU.
$2 H100s: How the GPU Rental Bubble Burst (latent.space)
H100s used to be $8/hr if you could get them. Now there's 7 different places sometimes selling them under $2. What happened?
Scuda – Virtual GPU over IP (github.com/kevmo314)
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Show HN: Squey, an open-source GPU-accelerated data visualization software (squey.org)
Squey 5.0 is out! Check out the new Parquet plugin and the revamped UISquey
炊紙(kashikishi) is a text editor that utilizes GPU to edit text in a 3D space (github.com/mitoma)
炊紙は三次元空間上でテキストを編集できるテキストエディタです。「かしきし」と発音します。