Hacker News with Generative AI: Quantization

SVDQuant: 4-Bit Quantization Powers 12B Flux on a 16GB 4090 GPU with 3x Speedup (hanlab.mit.edu)
A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU.
VPTQ: Extreme low-bit Quantization for real LLMs (github.com/microsoft)
Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.
EfficientQAT: LLM Quantization, gets a 2-bit llama2-70B outperform regular 13B (reddit.com)
Towards Optimal LLM Quantization (picovoice.ai)
On-Device LLM Inference Powered by X-Bit Quantization (github.com/Picovoice)