Hacker News with Generative AI: Matrix Multiplication

How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024) (alexarmbr.github.io)
This post details my recent efforts to write an optimized matrix multiplication kernel in CUDA using tensor cores on a NVIDIA Tesla T4 GPU. The goal is to compute $D = \alpha * A * B + \beta * C$, as fast as possible. In this equation $D,A,B$ and $C$ are large matrices full of half precision floating point numbers, and $\alpha$, $\beta$ are constants. This problem is usually referred to as a Half-precision Generalized Matrix Multiply, or HGEMM for short.
Optimizing Matrix Multiplication (coffeebeforearch.github.io)
Matrix multiplication is an incredibly common operation across numerous domains. It is also known as being “embarrassingly parallel”. As such, one common optimization is parallelization across threads on a multi-core CPU or GPU. However, parallelization is not a panacea. Poorly parallelized code may provide minimal speedups (if any).
Optimizing Matrix Multiplication on RDNA3 (seb-v.github.io)
In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels.
Karatsuba Matrix Multiplication and Its Efficient Hardware Implementations (arxiv.org)
While the Karatsuba algorithm reduces the complexity of large integer multiplication, the extra additions required minimize its benefits for smaller integers of more commonly-used bitwidths.
Experiments with Byte Matrix Multiplication (github.com/serge-sans-paille)
It's quite common in machine learning operations to multiply a matrix of unsigned byte by a matrix of signed byte.
Matrix Multiplication in Finite Fields (fileforma.substack.com)
ffGEMM is a fixed-point arithmetic library for fast matrix multiplications on CPU. This article introduces the underlying mathematics for Fileforma’s ffGEMM library.
Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022) (siboehm.com)