Hacker News with Generative AI: CUDA

Ask HN: Why hasn’t AMD made a viable CUDA alternative? (ycombinator.com)
I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.
Happy 18th Birthday CUDA (thechipletter.substack.com)
CuTile: New CUDA Alternative from Nvidia (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
Parallel Histogram Computation with CUDA (khushi-411.github.io)
The aim of the blog posts is to introduce a parallel histogram pattern, where each output element can be updated by any thread. Therefore, we should coordinate among threads as they update the output value. In this blog post, we will read the introduction about using atomic operations to serialize the updates of each element. Then, we will study an optimization technique: privatization. Let’s dig in!
Sorting algorithms with CUDA (ashwanirathee.com)
Building on my previous post on sorting algorithms, I implemented the same algorithms using CUDA to explore performance improvements through parallel computing.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling (github.com/deepseek-ai)
DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs. Written in CUDA, the library has no compilation need during installation, by compiling all kernels at runtime using a lightweight Just-In-Time (JIT) module.
AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition (sakana.ai)
At Sakana AI, we believe the path to develop much stronger AI systems is to automate the development of AI using AI. We aim to develop AI systems that can create even more capable and efficient AI systems.
CUDA is the incumbent, but is it any good? (modular.com)
Answering the question of whether CUDA is “good” is much trickier than it sounds. Are we talking about its raw performance? Its feature set? Perhaps its broader implications in the world of AI development? Whether CUDA is “good” depends on who you ask and what they need. In this post, we’ll evaluate CUDA from the perspective of the people who use it day-in and day-out—those who work in the GenAI ecosystem:
Introduction to CUDA programming for Python developers (pyspur.dev)
Rust-CUDA Project Restarted for Enabling Nvidia CUDA Kernels Within Rust Code (phoronix.com)
The open-source Rust CUDA project has been "rebooted" to get back onto the effort of allowing NVIDIA CUDA compute kernels to be coded within the Rust programming language.
DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX (tomshardware.com)
Show HN: HipScript – Run CUDA in the browser with WebAssembly and WebGPU (lights0123.com)
Online compiler for HIP and NVIDIA® CUDA® code to WebGPU
Show HN: Lightweight Llama3 Inference Engine – CUDA C (github.com/abhisheknair10)
Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling.
Show HN: A GPU-accelerated MD5 Hash Cracker, Written Using Rust and CUDA (github.com/vaktibabat)
MD5 hash cracking with CUDA and Rust, implemented from scratch
Show HN: Cudair – live-reloading for developing CUDA applications (github.com/ei-sugimoto)
cudair enable live-reloading for developing CUDA applications like golang-air. I recommend using docker.
Train a Mnist VAE with C and CUDA (github.com/ggerganov)
Hi, I just want to share what I have been working on recently. This is an example of training a MNIST VAE. The goal is to use only ggml pipeline and its implementation of ADAM optimizer.
Fast LLM Inference From Scratch (using CUDA) (andrewkchan.dev)
This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries.
Check if your performance intuition still works with CUDA (wordsandbuttons.online)
For those of you who don't know what CUDA is, let me explain. Imagine, buses were never invented. There are cars, trains, planes, and motorcycles, just not buses. And one day someone smart asks himself: “wouldn't it be splendid to have cars that would fit a lot of people? One guy could be driving, and all the rest will enjoy the ride.” “Right, like trucks but for people!” “No-no-no, who on earth would ever want to travel by truck?
CUDA Programming Course – High-Performance Computing with GPUs [video] (youtube.com)
John Nickolls "ultimately willed CUDA into existence" (twitter.com)
Initial CUDA Performance Lessons (probablydance.com)
I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you’ll write suboptimal code, so here are some lessons I had to learn to get things to run fast.
Zen, CUDA, and Tensor Cores – Part 1 [video] (youtube.com)
Zen, CUDA, and Tensor Cores, Part I: The Silicon (computerenhance.com)
Gemlite: Towards Building Custom Low-Bit Fused CUDA Kernels (mobiusml.github.io)
LibreCUDA – Launch CUDA code on Nvidia GPUs without the proprietary runtime (github.com/mikex86)
Open-Source AMD GPU Implementation of CUDA "Zluda" Has Been Taken Down (phoronix.com)
How to optimize a CUDA matmul kernel for cuBLAS-like performance (2022) (siboehm.com)
Run CUDA, unmodified, on AMD GPUs (scale-lang.com)
Show HN: UNet diffusion model in pure CUDA (github.com/clu0)
The One Billion Row Challenge in CUDA (tspeterkim.github.io)