Hacker News with Generative AI: Inference

LLM-D: Kubernetes-Native Distributed Inference at Scale (github.com/llm-d)
llm-d is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

Kubernetes, Distributed Systems, Inference

10 points by bbzjk7 59 days ago | 2 comments

LLM-D: Kubernetes-Native Distributed Inference (llm-d.ai)
llm-d is a Kubernetes-native high-performance distributed LLM inference framework - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

Kubernetes, Inference, Distributed Systems, Performance

120 points by smarterclayton 59 days ago | 15 comments

Mixture of Tunable Experts-DeepSeek R1 Behavior Modification at Inference Time (huggingface.co)

Generative AI, Machine Learning, Deep Learning, Inference

5 points by pr337h4m 79 days ago | 1 comments

Ironwood: The first Google TPU for the age of inference (google)
Ironwood is our most powerful, capable and energy efficient TPU yet, designed to power thinking, inferential AI models at scale.

Google, Inference, Artificial Intelligence, Hardware

453 points by meetpateltech 100 days ago | 176 comments

GPU Server with 8 RTX 4090 (a16z.com)
In today’s AI-driven world, the ability to train AI models locally and perform fast inference on GPUs at an optimal cost is more important than ever.

GPU, AI, Hardware, Training, Inference

5 points by m3at 104 days ago | 2 comments

Nvidia Blackwell Delivers World-Record DeepSeek-R1 Inference Performance (nvidia.com)
NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025.

Deep Learning, Inference, Hardware, Performance, NVIDIA

8 points by daviducolo 121 days ago | 0 comments

Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework (github.com/ai-dynamo)
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

Generative AI, Inference, Distributed Systems

150 points by ashvardanian 122 days ago | 39 comments

OpenArc – Lightweight Inference Server for OpenVINO (github.com/SearchSavior)
OpenArc is a lightweight inference API backend for Optimum-Intel from Transformers to leverage hardware acceleration on Intel CPUs, GPUs and NPUs through the OpenVINO runtime using OpenCL drivers.

Open Source, Machine Learning, Inference, Intel, Hardware Acceleration

17 points by marban 152 days ago | 2 comments

Evolving Deeper LLM Thinking (arxiv.org)
We explore an evolutionary search strategy for scaling inference time compute in Large Language Models.

Artificial Intelligence, Research, Evolutionary Algorithms, Inference

12 points by hardmaru 180 days ago | 0 comments

Open source inference time compute example from HuggingFace (github.com/huggingface)
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

Open source, Machine Learning, Inference, Computation, HuggingFace

88 points by burningion 214 days ago | 26 comments

Show HN: NCompass Technologies – yet another AI Inference API, but hear us out (ncompass.tech)
At nCompass, we’re building AI inference serving software that can reduce the costs of serving AI models at scale by 50%.

AI, Machine Learning, Inference, Software, Startups

37 points by adiraja 214 days ago | 34 comments

Fast LLM Inference From Scratch (using CUDA) (andrewkchan.dev)
This post is about building an LLM inference engine using C++ and CUDA from scratch without libraries.

CUDA, Programming, Inference, C++

344 points by homarp 216 days ago | 57 comments

Exploring inference memory saturation effect: H100 vs. MI300x (dstack.ai)
GPU memory plays a critical role in LLM inference, affecting both performance and cost. This benchmark evaluates memory saturation’s impact on inference using NVIDIA's H100 and AMD's MI300x with Llama 3.1 405B FP8.

GPU, Benchmarking, Inference, Hardware

57 points by latchkey 225 days ago | 12 comments

Accelerated AI Inference via Dynamic Execution Methods (arxiv.org)
In this paper, we focus on Dynamic Execution techniques that optimize the computation flow based on input.

Artificial Intelligence, Machine Learning, Computer Science, Optimization, Inference

67 points by PaulHoule 235 days ago | 3 comments

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference (cerebras.ai)
Frontier AI now runs at instant speed. Last week we ran a customer workload on Llama 3.1 405B at 969 tokens/s – a new record for Meta’s frontier model. Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world – 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet. In addition, we achieved the highest performance at 128K context length and shortest time-to-first-token latency, as measured by Artificial Analysis.

Artificial Intelligence, Speed, Performance, Inference

427 points by benchmarkist 242 days ago | 156 comments

GDDR7 Memory Supercharges AI Inference (semiengineering.com)
High bandwidth and low latency are paramount for AI-powered edge and endpoints.

AI, Hardware, Memory, Inference, Edge Computing

68 points by PaulHoule 263 days ago | 34 comments

Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s (cerebras.ai)
Today we’re announcing the biggest update to Cerebras Inference since launch. Cerebras Inference now runs Llama 3.1-70B at an astounding 2,100 tokens per second – a 3x performance boost over the prior release.

Artificial Intelligence, Hardware, Performance, Inference

147 points by campers 267 days ago | 84 comments

AMD GPU Inference (github.com/slashml)
This project provides a Docker-based inference engine for running Large Language Models (LLMs) on AMD GPUs. It's designed to work with models from Hugging Face, with a focus on the LLaMA model family.

Generative AI, AMD GPUs, Docker, Inference

270 points by fazkan 290 days ago | 99 comments

Cerebras Inference: AI at Instant Speed (cerebras.ai)

AI, Inference, Hardware, Performance

174 points by meetpateltech 325 days ago | 71 comments

Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B (twitter.com)

Generative AI, Inference, Benchmarks, Hardware

95 points by _micah_h 325 days ago | 42 comments

Cerebras Launches the Fastest AI Inference (cerebras.ai)

AI, Hardware, Inference

13 points by cs-fan-101 325 days ago | 1 comments

Inference is free and instant (fume.substack.com)