Hacker News with Generative AI: CUDA

CUDA version of GROMACS is faster on AMD than HIP port (scale-lang.com)
With the release of version 1.3.1, SCALE has reached a major compatibility milestone: the ability to run the CUDA version of GROMACS on AMD GPUs.

CUDA, AMD, GPU, Scientific Computing, Software

15 points by msond 54 days ago | 3 comments

Faster sorting with SIMD CUDA intrinsics (2024) (winwang.blog)
Recently, I finished a batch at the Recurse Center… is what I would have said if this post were written when I intended to write it (i.e. 3 months ago). My project there focused on a questionable application of CUDA (mostly irrelevant to this post), but it got me thinking more about other GPU-friendly algorithms.

CUDA, GPU Programming, Sorting Algorithms, Optimization

92 points by winwang 68 days ago | 11 comments

TScale – Distributed training on consumer GPUs (github.com/Foreseerr)
This repo contains transformer train and inference code written in C++ and CUDA.

Machine Learning, GPUs, Distributed Systems, C++, CUDA

130 points by zX41ZdbW 70 days ago | 27 comments

Distributed Continuous GPU Profiling (zymtrace.com)
Identify performance bottlenecks in CUDA kernels, optimize inference batch size, and eliminate idle GPU cycles —with zero friction.

GPU Profiling, CUDA, Performance Optimization, Machine Learning, Distributed Computing

12 points by tdullien 72 days ago | 2 comments

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch (arxiv.org)
CUDA Graphs -- a recent hardware feature introduced for NVIDIA GPUs -- aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result -- deploying CUDA Graphs hurts performance in many cases.

CUDA, PyTorch, Performance Optimization, GPU Programming, Deep Learning

84 points by mfiguiere 79 days ago | 8 comments

CubeCL: GPU Kernels in Rust for CUDA, ROCm, and WGPU (github.com/tracel-ai)
With CubeCL, you can program your GPU using Rust, taking advantage of zero-cost abstractions to develop maintainable, flexible, and efficient compute kernels.

Rust, GPU Programming, CUDA, ROCm, WGPU

210 points by ashvardanian 80 days ago | 41 comments

How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024) (alexarmbr.github.io)
This post details my recent efforts to write an optimized matrix multiplication kernel in CUDA using tensor cores on a NVIDIA Tesla T4 GPU. The goal is to compute $D = \alpha * A * B + \beta * C$, as fast as possible. In this equation $D,A,B$ and $C$ are large matrices full of half precision floating point numbers, and $\alpha$, $\beta$ are constants. This problem is usually referred to as a Half-precision Generalized Matrix Multiply, or HGEMM for short.

CUDA, GPU Programming, Matrix Multiplication, Optimization, Tensor Cores

147 points by skidrow 85 days ago | 17 comments

Show HN: LeetGPU, a playground to learn practice and hone your GPU programming. (ycombinator.com)
Learn, write, practice CUDA programming on LeetGPU.com, an online CUDA playground for anyone to write and execute CUDA code without needing a GPU and for free

CUDA, Programming, Online Tools, Learning

9 points by emblematicGPU 91 days ago | 2 comments

Nvidia adds native Python support to CUDA (thenewstack.io)

Nvidia, Python, CUDA, Programming, Software

460 points by apples2apples 100 days ago | 192 comments

Ask HN: Why hasn’t AMD made a viable CUDA alternative? (ycombinator.com)
I appreciate developing ROCm into something competitive with CUDA would require a lot of work, both internally within AMD and with external contributions to the relevant open source libraries.

AMD, CUDA, Open Source

194 points by spacebanana7 103 days ago | 187 comments

Happy 18th Birthday CUDA (thechipletter.substack.com)

CUDA, Software, Anniversaries

26 points by chmaynard 112 days ago | 5 comments

CuTile: New CUDA Alternative from Nvidia (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.

Nvidia, CUDA, Hardware

10 points by polyrand 116 days ago | 0 comments

Parallel Histogram Computation with CUDA (khushi-411.github.io)
The aim of the blog posts is to introduce a parallel histogram pattern, where each output element can be updated by any thread. Therefore, we should coordinate among threads as they update the output value. In this blog post, we will read the introduction about using atomic operations to serialize the updates of each element. Then, we will study an optimization technique: privatization. Let’s dig in!

CUDA, Parallel Computing, Optimization, Programming

8 points by coffeeaddict1 123 days ago | 0 comments

Sorting algorithms with CUDA (ashwanirathee.com)
Building on my previous post on sorting algorithms, I implemented the same algorithms using CUDA to explore performance improvements through parallel computing.

CUDA, Sorting Algorithms, Parallel Computing

150 points by ashwani-rathee 123 days ago | 41 comments

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling (github.com/deepseek-ai)
DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs. Written in CUDA, the library has no compilation need during installation, by compiling all kernels at runtime using a lightweight Just-In-Time (JIT) module.

Deep Learning, Machine Learning, CUDA, Optimization, Software Libraries

391 points by mfiguiere 137 days ago | 67 comments

AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition (sakana.ai)
At Sakana AI, we believe the path to develop much stronger AI systems is to automate the development of AI using AI. We aim to develop AI systems that can create even more capable and efficient AI systems.

Artificial Intelligence, CUDA, Optimization, Software Engineering, AI Development

45 points by throwing_away 142 days ago | 6 comments

CUDA is the incumbent, but is it any good? (modular.com)
Answering the question of whether CUDA is “good” is much trickier than it sounds. Are we talking about its raw performance? Its feature set? Perhaps its broader implications in the world of AI development? Whether CUDA is “good” depends on who you ask and what they need. In this post, we’ll evaluate CUDA from the perspective of the people who use it day-in and day-out—those who work in the GenAI ecosystem:

CUDA, Generative AI, Performance, Machine Learning, AI Development

17 points by mfiguiere 142 days ago | 1 comments

Introduction to CUDA programming for Python developers (pyspur.dev)

CUDA, Python, Programming

365 points by t55 142 days ago | 95 comments

Rust-CUDA Project Restarted for Enabling Nvidia CUDA Kernels Within Rust Code (phoronix.com)
The open-source Rust CUDA project has been "rebooted" to get back onto the effort of allowing NVIDIA CUDA compute kernels to be coded within the Rust programming language.

Rust, CUDA, Programming Languages, GPU Computing, Open Source

5 points by LinuxBender 161 days ago | 1 comments

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX (tomshardware.com)

Artificial Intelligence, Computer Hardware, CUDA

136 points by pseudolus 165 days ago | 72 comments

Show HN: HipScript – Run CUDA in the browser with WebAssembly and WebGPU (lights0123.com)
Online compiler for HIP and NVIDIA® CUDA® code to WebGPU

WebAssembly, WebGPU, CUDA, GPU Programming, Programming Languages

309 points by lights0123 187 days ago | 32 comments

Show HN: Lightweight Llama3 Inference Engine – CUDA C (github.com/abhisheknair10)
Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling.

Show HN, CUDA, Inference Engines

12 points by abhisheknair10 189 days ago | 0 comments

Show HN: A GPU-accelerated MD5 Hash Cracker, Written Using Rust and CUDA (github.com/vaktibabat)
MD5 hash cracking with CUDA and Rust, implemented from scratch

Security, Programming, Rust, GPU, CUDA

4 points by vaktibabat 194 days ago | 0 comments

Show HN: Cudair – live-reloading for developing CUDA applications (github.com/ei-sugimoto)
cudair enable live-reloading for developing CUDA applications like golang-air. I recommend using docker.

CUDA, Software Development, Tools, Programming, Open Source

5 points by ei-sugimoto 203 days ago | 0 comments

Train a Mnist VAE with C and CUDA (github.com/ggerganov)
Hi, I just want to share what I have been working on recently. This is an example of training a MNIST VAE. The goal is to use only ggml pipeline and its implementation of ADAM optimizer.

Machine Learning, Programming, C, CUDA, Generative AI

54 points by bssrdf 203 days ago | 2 comments

Fast LLM Inference From Scratch (using CUDA) (andrewkchan.dev)
This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries.

CUDA, Programming, Inference, C++

344 points by homarp 211 days ago | 57 comments

Check if your performance intuition still works with CUDA (wordsandbuttons.online)
For those of you who don't know what CUDA is, let me explain. Imagine, buses were never invented. There are cars, trains, planes, and motorcycles, just not buses. And one day someone smart asks himself: “wouldn't it be splendid to have cars that would fit a lot of people? One guy could be driving, and all the rest will enjoy the ride.” “Right, like trucks but for people!” “No-no-no, who on earth would ever want to travel by truck?

Performance Optimization, CUDA, Programming, Computer Science

12 points by EvgeniyZh 240 days ago | 0 comments

CUDA Programming Course – High-Performance Computing with GPUs [video] (youtube.com)

CUDA, Programming, High-Performance Computing, GPUs, Video

26 points by yarapavan 243 days ago | 1 comments

John Nickolls "ultimately willed CUDA into existence" (twitter.com)

CUDA, Computer Graphics, Hardware, History, Technology

13 points by wmf 268 days ago | 3 comments

Initial CUDA Performance Lessons (probablydance.com)
I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you’ll write suboptimal code, so here are some lessons I had to learn to get things to run fast.

CUDA, Programming, Performance, C++

164 points by ibobev 275 days ago | 43 comments