Hacker News with Generative AI: GPU Optimization

Optimizing Matrix Multiplication on RDNA3 (seb-v.github.io)
In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels.
Sorting for Rendering (linebender.org)
Many rendering algorithms (including a proposed sparse strip technique for path rendering, and also Gaussian Splatting) rely on sorting. Because the GPU has a different architecture to the CPU, programs running on the GPU have different performance characteristics, and this changes which sorting algorithms are optimal for a particular context. In particular, sorting algorithms that exploit parallelism tend to be more suited to the GPU.