KV cache Archives

Rows of GPU accelerator servers in a data center rack running large language model inference workloads

Tuning vLLM KV Cache and Preemption in Production Serving

July 18, 2026 0 Comments

vLLM’s PagedAttention and iteration-level continuous batching are the two mechanisms that take an autoregressive transformer from the 20–40% GPU utilization typical under static batching toward the 2–4× throughput the SOSP 2023 paper reports. Production value does not come from enabling them — both are on by default — but from …

Editorial team

Person facing a big screen with numbers by Ron Lach

Artificial Intelligence

Prefill Decode Disaggregation Doubles Your LLM Throughput

June 14, 2026 0 Comments

Prefill-decode disaggregation separates the two phases of LLM inference — prompt processing and token generation — onto dedicated GPU pools, eliminating the head-of-line blocking that causes latency spikes under concurrent load. Production deployments report 1.5x to 2.5x throughput gains, with cache-aware variants like Together AI’s CPD pushing improvements to 40%. …

Editorial team

Person analyzing AI data on large screen representing GPU cluster scheduling

Artificial Intelligence

GPU Schedulers Waste 38% Time on Agent Cache Regeneration

June 11, 2026 0 Comments

Agent Cache Rebuilds Waste 38% GPU When researchers at the University of Hong Kong instrumented a 32-GPU A100 cluster running SWE-bench coding agents on vLLM v0.6.0, they found a number that should bother every platform engineer: 38% of total execution time was spent regenerating KV cache that had been discarded …

Editorial team

Person analyzing data on a large screen representing AI cloud infrastructure costs

Artificial Intelligence

Agentic AI Workflows Cost 5x More Than You Budgeted

June 6, 2026 0 Comments

One Agent Call Becomes Fifteen Google’s TPU 8i dedicates 288 GB of HBM and a dedicated Collectives Acceleration Engine specifically because a single agentic request now triggers an average of 6-12 downstream model calls. The infrastructure bill for “let me ask the AI” has quietly multiplied, and most teams haven’t …

Editorial team

Blue server rack in datacenter representing AI cloud infrastructure

Artificial Intelligence

Google TPU v8 Puts KV Cache on Silicon to Cut Inference Cost

May 31, 2026 0 Comments

Google Put KV Cache on Silicon Google’s TPU 8i triples on-chip SRAM to 384 MB and crams 288 GB of HBM onto a single chip — enough to host massive KV caches entirely in silicon, bypassing the memory wall that has bottlenecked LLM inference since the transformer era began. The …

Editorial team