Hacker News with Generative AI: Inference

Accelerated AI Inference via Dynamic Execution Methods (arxiv.org)
In this paper, we focus on Dynamic Execution techniques that optimize the computation flow based on input.
Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference (cerebras.ai)
Frontier AI now runs at instant speed. Last week we ran a customer workload on Llama 3.1 405B at 969 tokens/s – a new record for Meta’s frontier model. Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world – 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet. In addition, we achieved the highest performance at 128K context length and shortest time-to-first-token latency, as measured by Artificial Analysis.
GDDR7 Memory Supercharges AI Inference (semiengineering.com)
High bandwidth and low latency are paramount for AI-powered edge and endpoints.
Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s (cerebras.ai)
Today we’re announcing the biggest update to Cerebras Inference since launch. Cerebras Inference now runs Llama 3.1-70B at an astounding 2,100 tokens per second – a 3x performance boost over the prior release.
AMD GPU Inference (github.com/slashml)
This project provides a Docker-based inference engine for running Large Language Models (LLMs) on AMD GPUs. It's designed to work with models from Hugging Face, with a focus on the LLaMA model family.
Cerebras Inference: AI at Instant Speed (cerebras.ai)
Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B (twitter.com)
Cerebras Launches the Fastest AI Inference (cerebras.ai)
Inference is free and instant (fume.substack.com)
Abstract representations emerge in human hippocampal neurons during inference (nature.com)
Nvidia NVLink and Nvidia NVSwitch Supercharge Large Language Model Inference (nvidia.com)
Groq Raises $640M to Meet Soaring Demand for Fast AI Inference (groq.com)
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI (bentoml.com)
AMD's MI300X Outperforms Nvidia's H100 for LLM Inference (tensorwave.com)
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TRT-LLM, and TGI (bentoml.com)
26× Faster Inference with Layer-Condensed KV Cache for Large Language Models (arxiv.org)
Practical Llama 3 inference implemented in a single Java file (github.com/mukel)
Effort – a possibly new algorithm for LLM Inference (kolinko.github.io)
Ampere Readies 256-Core CPU Beast, Awaits the AI Inference Wave (nextplatform.com)