Prefill Decode Disaggregation Doubles Your LLM Throughput

Prefill-decode disaggregation separates the two phases of LLM inference — prompt processing and token generation — onto dedicated GPU pools, eliminating the head-of-line blocking that causes latency spikes under concurrent load. Production deployments report 1.5x to 2.5x throughput gains, with cache-aware variants like Together AI’s CPD pushing improvements to 40%. Every major serving framework in 2026, including vLLM, SGLang, and NVIDIA Dynamo, now ships disaggregated inference as the default architecture for large-scale deployment.

A 32K-Token Prefill Stalls Everything

When a single 32,000-token prompt arrives in a collocated LLM serving system, every ongoing decode stream in that GPU’s batch stalls. Time-per-output-token (TPOT) can spike by 2 to 30 times while the GPU churns through the prefill forward pass. This is not a hypothetical edge case — it is the fundamental reason that nearly every production-grade LLM serving framework in 2026, from vLLM to NVIDIA Dynamo, now splits prefill and decode into separate compute pools.

The technique is called prefill-decode disaggregation. It was introduced as DistServe by Hao AI Lab at UCSD, received heavy pushback from the open-source community in 2024, and then became the default playbook across the industry within twelve months. The reason is simple economics: disaggregation eliminates head-of-line blocking between two workloads that have opposite hardware requirements, and the throughput gains are large enough to justify the added infrastructure complexity.

Prefill and Decode Pull Opposite Ways

LLM inference is not one operation. It is two phases with conflicting resource profiles:

  • Prefill processes the entire input prompt in a single forward pass. Every token attends to every other token through dense matrix multiplies. This phase is compute-bound — the bottleneck is raw FP8 TFLOPS, not memory bandwidth. It wants GPUs like the H100 or B200 with maximum FLOPS and fast NVLink for tensor parallelism.
  • Decode generates output tokens one at a time. Each new token must load key-value vectors for every previously processed token from HBM. This phase is memory-bandwidth-bound — the GPU spends most of its cycle loading KV cache tensors, not computing. It wants GPUs like the H200 with 4.8 TB/s HBM bandwidth and large capacity.

When both phases share the same GPU, prefill jobs block decode batches from continuing their autoregressive generation. A single large prompt arriving mid-generation stalls every other user in the queue. The Hao AI Lab retrospective documents that chunked prefill mitigations help, but they do not eliminate the problem — a burst of large prefills still inflates TPOT unpredictably under real traffic source.

How Disaggregation Actually Works

The architecture is straightforward in concept. You provision two separate pools of GPU nodes:

  1. Prefill nodes receive incoming requests, process the full prompt through the model, and produce a KV cache block for each request.
  2. The KV cache is transferred to decode nodes over the network using RDMA or TCP. In 2026, the standard mechanism is NIXL (NVIDIA Inference Xfer Library) for vLLM-based stacks, or AMD’s MORI-IO connector for MI300X deployments.
  3. Decode nodes receive the KV cache and run autoregressive generation until the response completes.

vLLM v0.8+ supports disaggregated prefill natively via the NixlConnector, where the prefill instance serves as a KV cache producer and the decode instance acts as a consumer source. If you are still deciding between inference engines, our vLLM vs SGLang comparison covers how each handles disaggregation differently. NVIDIA Dynamo 1.0 implements the same split at scale with its own routing layer on top of vLLM source.

AMD’s MORI-IO connector supports two transfer modes. In write mode (the default), the proxy dispatches to prefill and decode simultaneously. As prefill computes each layer, it pushes KV data directly into decode’s memory via RDMA, so decode can begin generating the moment prefill finishes. In read mode, the proxy waits for prefill to complete, then forwards KV block IDs to decode, which pulls the data itself. Write mode delivers lower TTFT; read mode offers simpler orchestration source.

Cache-Aware Routing Splits Cold From Warm

Together AI pushed the architecture further with cache-aware prefill-decode disaggregation (CPD), published in March 2026. The core insight: not all prefills are equal. In real workloads, many requests contain large portions of context that have been seen before — shared system prompts, conversation history, common documents. These are warm requests. Others introduce mostly new context requiring full computation — these are cold requests.

In standard disaggregation, cold and warm prefills share the same prefill capacity. A 100K-token cold prompt occupies prefill resources for seconds, and warm requests that could have been served through cache reuse sit waiting in the same queue. CPD solves this by splitting inference into three roles source:

RoleWorkloadBehavior
Pre-Prefill nodesCold (low-reuse) promptsCompute new context, write KV cache to distributed store
Prefill nodesWarm (high-reuse) requestsRead KV blocks from cache instead of recomputing
Decode nodesAll generationStandard autoregressive decoding

The result: CPD improves sustainable queries-per-second by 35–40% over existing disaggregated designs, while maintaining tighter tail latency bounds even when large cold prompts hit the system source. The key design principle is simple — do not let expensive cold prefills block the fast path for reusable context. This connects directly to the broader problem of GPU scheduler waste from redundant KV cache regeneration, where up to 38% of compute time is lost regenerating context that could be cached and reused.

Single-Node Disaggregation Changes the Math

The most persistent misconception about PD disaggregation is that it requires a multi-node datacenter cluster. AMD and the vLLM team demonstrated in April 2026 that disaggregation can run entirely within a single 8-GPU MI300X node, delivering 2.5x higher goodput compared to standard collocated serving on the same hardware source.

The split is simple: four GPUs handle prefill while the other four handle decode. The KV cache transfer between the two instances stays on-node via MORI-IO’s RDMA layer, avoiding network round-trips entirely. The benchmark used Qwen3-235B-A22B-FP8 at 8 requests per second with 2,000-token prompts and 1,000-token outputs source.

This matters because it means disaggregation is no longer a datacenter-only technique. Any team with a single 8-GPU node running vLLM can benefit. The goodput improvement comes from eliminating inter-token-latency (ITL) spikes that plague collocated serving under concurrency — dedicated decode GPUs ensure stable, predictable token generation regardless of prefill load.

Hardware Planning: Match GPU to Phase

Because prefill and decode have different bottlenecks, you can optimize cost by assigning different GPU types to each role. Spheron’s 2026 deployment guide provides a clear mapping source. For a deeper look at how KV cache memory constrains decode performance, the math on HBM consumption per token explains why decode wants bandwidth-heavy GPUs:

GPUBest RoleHBM BandwidthKey Advantage
H100 SXM5Prefill3.35 TB/sHigh FP8 TFLOPS, lower cost
B200 SXM6Prefill7.7 TB/sFP4 tensor cores, Blackwell
H200 SXM5Decode4.8 TB/s141 GB HBM3e, bandwidth-optimized
A100 80GBDecode (small models)2.0 TB/sCost-effective for small KV cache

vLLM’s February 2026 benchmarks on NVIDIA GB200 demonstrate what happens when you pair disaggregation with hardware purpose-built for it: 26,200 prefill tokens per GPU-second and 10,100 decode tokens per GPU-second on DeepSeek R1/V3 workloads, using a deployment of 4 prefill instances (2 GB200s each) and 1 decode instance (8 GB200s) source. The GB200’s 8 TB/s memory bandwidth, FP4 compute throughput, and NVLink-C2C interconnect all contribute to gains over H200 deployments.

When You Should NOT Disaggregate

Full disaggregation is not always the right answer. For shorter prompts and lower concurrency, chunked prefill solves the interference problem with far less infrastructure overhead. Chunked prefill breaks a long prompt into fixed-size chunks and interleaves them with ongoing decode steps — decode is never fully blocked source.

ApproachWhen to UseOverheadThroughput Gain
Chunked prefillPrompts under 8K tokens, single-nodeLow (no network)20–40%
Full disaggregationPrompts 8K+, high concurrencyMedium (KV transfer)1.5–2.5x

The practical recommendation from Spheron’s engineering team: start with chunked prefill. If you are consistently hitting 8K+ prompt lengths with high concurrency and chunked prefill is not moving the needle enough, then add the infrastructure for full disaggregation source. The vLLM team echoes this — disaggregation’s overhead includes RDMA transfer wait time and proxy serialization, which only pay off when the interference savings exceed the transfer cost source.

References