Hacker News with Generative AI: GPU Optimization

An Almost Pointless Exercise in GPU Optimization (speechmatics.com)
Not everyone is able to write funky fused operators to make ML models run faster on GPUs using clever quantisation tricks. However lots of developers work with algorithms that feel like they should be able to leverage the thousands of cores in a GPU to run faster than using the dozens of cores on a server CPU. To see what is possible and what is involved, I revisited the first problem I ever considered trying to accelerate with a GPU.
A handy metric is needed for gauging if GPUs are being used optimally (theregister.com)
GPU accelerators used in AI processing are costly items, so making sure you get the best usage out of them ought to be a priority, yet the industry lacks an effective way of measuring this, says the Uptime Institute.
Mipmap selection in too much detail (pema.dev)
In this post, I want to shed some light on something I’ve been wondering about for a while: How exactly are mipmap levels selected when sampling textures on the GPU?
Optimizing Matrix Multiplication on RDNA3 (seb-v.github.io)
In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels.
Sorting for Rendering (linebender.org)
Many rendering algorithms (including a proposed sparse strip technique for path rendering, and also Gaussian Splatting) rely on sorting. Because the GPU has a different architecture to the CPU, programs running on the GPU have different performance characteristics, and this changes which sorting algorithms are optimal for a particular context. In particular, sorting algorithms that exploit parallelism tend to be more suited to the GPU.