LLM inference Archives

Rows of GPU accelerator nodes in a data center rack, representing disaggregated inference compute pools for attention and expert modules

Attention-FFN Disaggregation: MoE Inference’s Next Step

July 18, 2026 0 Comments

Attention-FFN disaggregation (AFD) is the next architectural split beyond prefill-decode disaggregation for large mixture-of-experts (MoE) models: it separates the attention modules from the feed-forward expert modules onto distinct GPU pools inside the decode phase, fusing the extra activation transfer with the all-to-all dispatch that MoE expert parallelism already requires. The …

Editorial team

Rows of GPU accelerator servers in a data center rack running large language model inference workloads

Artificial Intelligence

Tuning vLLM KV Cache and Preemption in Production Serving

July 18, 2026 0 Comments

vLLM’s PagedAttention and iteration-level continuous batching are the two mechanisms that take an autoregressive transformer from the 20–40% GPU utilization typical under static batching toward the 2–4× throughput the SOSP 2023 paper reports. Production value does not come from enabling them — both are on by default — but from …

Editorial team

Artificial Intelligence

Model Distillation: 32B Beats o1-Mini at Half the Cost

July 9, 2026 0 Comments

A 32-billion-parameter student model fine-tuned on DeepSeek-R1’s reasoning traces scored 72.6% Pass@1 on AIME 2024, beating OpenAI’s o1-mini (63.6%) while costing roughly an order of magnitude less per token to serve. Released with the DeepSeek-R1 checkpoints, it is the strongest production evidence yet that knowledge distillation—not larger GPUs—is the dominant …

Editorial team

Artificial Intelligence

LLM Inference Nondeterminism: Why Temperature 0 Fails You

July 2, 2026 0 Comments

LLM inference nondeterminism means identical prompts can return different outputs even at temperature 0, because dynamic batching changes the order of floating-point reductions inside GPU kernels — a property called batch invariance. Thinking Machines Lab measured 80 distinct completions from 1,000 identical requests on Qwen3-235B, and researchers documented up to …

Editorial team

CUDA graphs LLM inference — CUDA Graphs + torch.compile: 1.65x LLM Decode Speedup

Artificial Intelligence

CUDA Graphs + torch.compile: 1.65x LLM Decode Speedup

June 30, 2026 0 Comments

A single decode step for Llama 3.1 8B on an H100 SXM5 takes 8.4 milliseconds in eager mode. Capture that same forward pass as a CUDA graph and it drops to 5.1 milliseconds — a 1.65× speedup. The reason is not faster math; it is the elimination of CPU-side kernel-launch …

Editorial team

Server racks in a data center representing cloud GPU infrastructure

Artificial Intelligence

NVIDIA NIM Economics: Where Self-Host Beats Every API

June 29, 2026 0 Comments

A single NVIDIA H100 GPU running a self-hosted NIM container costs roughly $1,950 per month on RunPod at $2.69 per hour, yet serves the same OpenAI-compatible /v1/chat/completions endpoint as GPT-4.1 — which bills $6 per million blended tokens. The crossover where NIM beats every per-token API sits around 300–500 million …

Editorial team

Continuous batching for LLM inference on GPUs

Artificial Intelligence

Continuous Batching: Why 60% of Your GPU Sits Idle

June 26, 2026 0 Comments

Naive static batching leaves roughly 60% of an H100 GPU idle during LLM serving, because finished requests hold their slots until the slowest sequence in the batch completes. Continuous batching — iteration-level scheduling introduced in the Orca paper and now the default in vLLM, TensorRT-LLM and TGI — fixes this …

Editorial team

prefill decode disaggregation — Prefill-Decode Disaggregation: NVIDIA's 7x Inference Fix

Artificial Intelligence

Prefill-Decode Disaggregation: NVIDIA’s 7x Inference Fix

June 22, 2026 0 Comments

Every LLM inference request is two workloads pretending to be one. Prefill processes your entire prompt in a single compute-bound forward pass — it wants raw FP8 TFLOPS. Decode generates tokens one at a time by streaming KV cache tensors through memory — it wants HBM bandwidth. Running both on …

Editorial team

Person facing a big screen with numbers by Ron Lach

Artificial Intelligence

Prefill Decode Disaggregation Doubles Your LLM Throughput

June 14, 2026 0 Comments

Prefill-decode disaggregation separates the two phases of LLM inference — prompt processing and token generation — onto dedicated GPU pools, eliminating the head-of-line blocking that causes latency spikes under concurrent load. Production deployments report 1.5x to 2.5x throughput gains, with cache-aware variants like Together AI’s CPD pushing improvements to 40%. …

Editorial team

Data center server racks with blue lighting representing AI cloud computing infrastructure

Artificial Intelligence

vLLM vs SGLang: Which Engine Actually Wins in 2026?

June 13, 2026 0 Comments

On H100 SXM5 80GB running Llama 3.3 70B Instruct at FP8, SGLang serves 1,920 tokens per second at 50-way concurrency — just 3.8% faster than vLLM’s 1,850. But swap to Llama 3.1 8B, and that gap explodes to 29%: SGLang hits 16,200 tok/s versus vLLM’s 12,500. The inference engine you …

Editorial team

Person facing a large screen displaying data and numbers, representing AI cloud computing infrastructure

Artificial Intelligence

Serverless GPU Cold Starts Take 40s – Here’s How to Fix

June 10, 2026 0 Comments

The 1000x Latency Gap A cold-start instance on a serverless GPU platform produces its first token after more than 40 seconds. A warm instance generates subsequent tokens in roughly 30 milliseconds. That is a latency ratio of over 1,300:1 between the cold and warm states, and it is the single …

Editorial team

Blue server rack in datacenter representing AI cloud infrastructure

Artificial Intelligence

Google TPU v8 Puts KV Cache on Silicon to Cut Inference Cost

May 31, 2026 0 Comments

Google Put KV Cache on Silicon Google’s TPU 8i triples on-chip SRAM to 384 MB and crams 288 GB of HBM onto a single chip — enough to host massive KV caches entirely in silicon, bypassing the memory wall that has bottlenecked LLM inference since the transformer era began. The …

Editorial team