Hacker News with Generative AI: Tensor Cores

How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024) (alexarmbr.github.io)
This post details my recent efforts to write an optimized matrix multiplication kernel in CUDA using tensor cores on a NVIDIA Tesla T4 GPU. The goal is to compute $D = \alpha * A * B + \beta * C$, as fast as possible. In this equation $D,A,B$ and $C$ are large matrices full of half precision floating point numbers, and $\alpha$, $\beta$ are constants. This problem is usually referred to as a Half-precision Generalized Matrix Multiply, or HGEMM for short.
Nvidia Tensor Core Programming (leimao.github.io)
NVIDIA Tensor Cores are dedicated accelerators for general matrix multiplication (GEMM) operations on NVIDIA GPUs since the Volta architecture.
Zen, CUDA, and Tensor Cores – Part 1 [video] (youtube.com)