Hacker News with Generative AI: Inference

Open source inference time compute example from HuggingFace (github.com/huggingface)
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
Show HN: NCompass Technologies – yet another AI Inference API, but hear us out (ncompass.tech)
At nCompass, we’re building AI inference serving software that can reduce the costs of serving AI models at scale by 50%.
Fast LLM Inference From Scratch (using CUDA) (andrewkchan.dev)
This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries.
Exploring inference memory saturation effect: H100 vs. MI300x (dstack.ai)
GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.
Accelerated AI Inference via Dynamic Execution Methods (arxiv.org)
In this paper, we focus on Dynamic Execution techniques that optimize the computation flow based on input.
Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference (cerebras.ai)
Frontier AI now runs at instant speed. Last week we ran a customer workload on Llama 3.1 405B at 969 tokens/s – a new record for Meta’s frontier model. Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world – 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet. In addition, we achieved the highest performance at 128K context length and shortest time-to-first-token latency, as measured by Artificial Analysis.
GDDR7 Memory Supercharges AI Inference (semiengineering.com)
High bandwidth and low latency are paramount for AI-powered edge and endpoints.
Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s (cerebras.ai)
Today we’re announcing the biggest update to Cerebras Inference since launch. Cerebras Inference now runs Llama 3.1-70B at an astounding 2,100 tokens per second – a 3x performance boost over the prior release.
AMD GPU Inference (github.com/slashml)
This project provides a Docker-based inference engine for running Large Language Models (LLMs) on AMD GPUs. It's designed to work with models from Hugging Face, with a focus on the LLaMA model family.
Cerebras Inference: AI at Instant Speed (cerebras.ai)
Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B (twitter.com)
Cerebras Launches the Fastest AI Inference (cerebras.ai)
Inference is free and instant (fume.substack.com)
Abstract representations emerge in human hippocampal neurons during inference (nature.com)
Nvidia NVLink and Nvidia NVSwitch Supercharge Large Language Model Inference (nvidia.com)
Groq Raises $640M to Meet Soaring Demand for Fast AI Inference (groq.com)
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI (bentoml.com)
AMD's MI300X Outperforms Nvidia's H100 for LLM Inference (tensorwave.com)
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TRT-LLM, and TGI (bentoml.com)
26× Faster Inference with Layer-Condensed KV Cache for Large Language Models (arxiv.org)
Practical Llama 3 inference implemented in a single Java file (github.com/mukel)
Effort – a possibly new algorithm for LLM Inference (kolinko.github.io)
Ampere Readies 256-Core CPU Beast, Awaits the AI Inference Wave (nextplatform.com)