Prefill-decode disaggregation separates the two phases of LLM inference — prompt processing and token generation — onto dedicated GPU pools, eliminating the head-of-line blocking that causes latency spikes under concurrent load. Production deployments report 1.5x to 2.5x throughput gains, with cache-aware variants like Together AI’s CPD pushing improvements to 40%. …
GPU Schedulers Waste 38% Time on Agent Cache Regeneration
Agent Cache Rebuilds Waste 38% GPU When researchers at the University of Hong Kong instrumented a 32-GPU A100 cluster running SWE-bench coding agents on vLLM v0.6.0, they found a number that should bother every platform engineer: 38% of total execution time was spent regenerating KV cache that had been discarded …
Agentic AI Workflows Cost 5x More Than You Budgeted
One Agent Call Becomes Fifteen Google’s TPU 8i dedicates 288 GB of HBM and a dedicated Collectives Acceleration Engine specifically because a single agentic request now triggers an average of 6-12 downstream model calls. The infrastructure bill for “let me ask the AI” has quietly multiplied, and most teams haven’t …
Google TPU v8 Puts KV Cache on Silicon to Cut Inference Cost
Google Put KV Cache on Silicon Google’s TPU 8i triples on-chip SRAM to 384 MB and crams 288 GB of HBM onto a single chip — enough to host massive KV caches entirely in silicon, bypassing the memory wall that has bottlenecked LLM inference since the transformer era began. The …