Prefill-Decode Disaggregation: NVIDIA's 7x Inference Fix

Every LLM inference request is two workloads pretending to be one. Prefill processes your entire prompt in a single compute-bound forward pass — it wants raw FP8 TFLOPS. Decode generates tokens one at a time by streaming KV cache tensors through memory — it wants HBM bandwidth. Running both on the same GPU means each phase blocks the other, and NVIDIA’s own SemiAnalysis InferenceX benchmarks show that separating them delivers up to 7x throughput on Blackwell for DeepSeek R1 (NVIDIA Technical Blog). That gap is why disaggregated inference went from research paper to production default in under two years.

Prefill and Decode Want Opposite Hardware

The fundamental insight is mechanical. During prefill, the model computes attention over every token in your prompt simultaneously. A 32K-token prompt triggers a dense matrix multiply — the tensor cores run flat out and memory bandwidth is irrelevant. During decode, the model generates one token per step, and each step must read the key-value vectors for every preceding token from HBM. With a 4K-token context and 256 output tokens, that is 256 sequential KV cache reads spanning thousands of entries. Compute is nearly idle; the GPU is starved for memory bandwidth (Spheron Engineering Blog).

These are diametrically opposed bottlenecks. A GPU optimized for prefill — high TFLOPS, moderate bandwidth like the H100 SXM5 at 3,958 FP8 TFLOPS and 3.35 TB/s HBM3 — underutilizes its tensor cores during decode, which runs at 5–15% GPU utilization. A GPU optimized for decode — high bandwidth like the A100 80G at 2 TB/s HBM2e — wastes its memory subsystem during prefill, which barely touches HBM. Monolithic serving pays both penalties simultaneously (Spheron Dynamo Guide).

Head-of-Line Blocking Is the Real Killer

Throughput math alone does not explain why teams re-architect their serving stack. The dominant production pain is latency variance caused by head-of-line blocking. In a monolithic vLLM or SGLang deployment, all requests share the same continuous batching loop. When a long-prompt request enters prefill, it monopolizes the GPU’s compute for a sustained window. Every decode batch queued behind it stalls — tokens stop flowing for existing users mid-generation. A single 32K-token request arriving mid-conversation can spike tail latency for hundreds of concurrent users (Spheron Engineering Blog).

This is not a theoretical concern. It is the reason interactivity targets — the ~50 tokens/second per user that makes an assistant feel responsive — are so hard to hit under mixed workloads. Short prompts and long prompts interleave unpredictably. The prefill spikes are unbounded. Continuous batching in a monolithic vLLM or SGLang deployment smooths some of this by packing decode steps between prefill chunks, but it cannot eliminate the fundamental resource contention.

How Disaggregation Reshapes the Serving Topology

Disaggregated inference splits the serving cluster into two independent worker pools. Prefill nodes receive incoming requests, run the full forward pass over the prompt, and produce a KV cache block. Decode nodes receive that KV cache and run autoregressive generation to completion. A router sits in front, directing traffic and tracking where each request’s KV cache lives (NVIDIA Technical Blog).

The critical dependency is moving the KV cache from prefill node to decode node fast enough that the transfer does not eat the gains. For a large model, the KV cache for a 32K-token context can be several gigabytes. If you ship it over a standard TCP socket, you add hundreds of milliseconds of latency and the disaggregation benefit evaporates. This is where the transport layer becomes the entire ballgame.

NIXL: The KV Cache Transport

NVIDIA’s Inference Transfer Library (NIXL) is the open-source, vendor-agnostic data movement library that turns disaggregated inference from a slide into a deployed system. NIXL moves KV cache tensors between GPUs using RDMA over InfiniBand, NVLink, GPU-Direct storage, or fallback TCP. It provides a non-blocking API with dynamic metadata exchange, so prefill and decode pools can scale elastically without pre-registered endpoints (NVIDIA Technical Blog).

NIXL is not Dynamo-specific. As of mid-2026 it has been integrated into vLLM, SGLang, TensorRT-LLM, and LMCache — meaning the KV cache transfer primitive is available across the major inference engines regardless of which orchestration layer you run on top. The library also supports tiering KV cache to NVMe storage and cloud object stores, which extends disaggregation into a three-tier memory hierarchy: GPU HBM for hot decode, CPU RAM for warm cache, NVMe for cold long-context prefixes (NVIDIA Technical Blog).

Chunked Prefill vs Full Disaggregation

Full disaggregation is not the only tool, and it is not always the right one. Chunked prefill — supported natively in vLLM and SGLang — solves a similar problem with zero infrastructure overhead. Instead of processing a long prompt in one blocking pass, it breaks the prompt into fixed-size chunks and interleaves them with ongoing decode steps. A 32K-token prompt becomes four 8K-token chunks, with decode batches running between each. Decode is never fully blocked (Spheron Engineering Blog).

Dimension	Chunked Prefill	Full Disaggregation
Infrastructure	Single node, no changes	Two node pools + router + NIXL
Head-of-line blocking	Reduced, not eliminated	Eliminated
Hardware matching	Same GPU for both phases	Match GPU tier to each phase
KV cache transfer	None (local)	RDMA/NVLink over network
Best for	Single-node, moderate load	Multi-node, high concurrency

The decision is a function of scale. If you are serving a 70B model on a single 8-GPU node with moderate concurrency, chunked prefill captures most of the benefit. Once you cross into multi-node deployments — anything requiring tensor parallelism across nodes, or serving thousands of concurrent users — the head-of-line blocking and hardware mismatch costs make full disaggregation worth the complexity.

Matching GPU Tiers to Each Phase

One of the most under-discussed wins of disaggregation is heterogeneous hardware. Once prefill and decode are separate services, you can match each to its actual bottleneck:

Prefill nodes: H100 SXM5 or B200 — maximize FP8 TFLOPS for dense attention compute. H200 for very long contexts (128K+) that need larger KV cache capacity in HBM.
Decode nodes: A100 80G or H100 PCIe — maximize HBM bandwidth per dollar. The A100 80G at 2 TB/s handles token generation well at roughly half the hourly cost of an H100 SXM5 (Spheron Dynamo Guide).

This tier-mixing strategy compounds. Decode pools are typically larger than prefill pools because generation takes longer than prompt processing. Running decode on cheaper, bandwidth-optimized GPUs can cut cluster cost by 30–40% without touching per-token latency — a saving that monolithic serving structurally cannot access because it is locked into whichever GPU handles both phases.

Multi-Round Agent Workloads Break the Split

The disaggregation model assumes a clean prefill-then-decode sequence per request. Agentic workloads violate that assumption. A multi-turn coding agent or RAG pipeline sends a long initial prompt, gets a short response, then sends a slightly longer follow-up that reuses most of the prior context. Each turn has an incremental prefill — the new tokens — followed by decode. The prefill and decode phases interleave unpredictably across rounds, and the optimal split between prefill and decode GPUs shifts dynamically (arXiv:2602.14516, He et al.).

The AMPD framework from that same paper addresses this with adaptive prefill placement: the system decides in real time whether to run incremental prefill work on the decode node (avoiding a KV transfer) or ship it to a dedicated prefill node (preserving decode throughput). Empirical results show substantial SLO attainment improvements over static disaggregation baselines for multi-round workloads. The takeaway: if you are serving agents, naive prefill-decode split will underperform a system that adapts the split per request.

Production Adoption Is Already Wide

This is not vaporware. NVIDIA lists ByteDance, CoreWeave, Together AI, Baseten, Tencent Cloud, Crusoe, Nebius, Pinterest, and over a dozen others as having deployed Dynamo in production as of March 2026. AWS, Google Cloud, Microsoft Azure, Alibaba Cloud, and Oracle have all built managed Kubernetes integrations for it. SGLang contributed its HiCache solution to Dynamo’s router, LMCache integrated storage-tiered KV caching, and LangChain built an integration that injects agentic routing hints (NVIDIA Technical Blog). The CNCF project llm-d offers a Kubernetes-native, vendor-neutral path to the same disaggregated architecture for teams that want to avoid NVIDIA’s control plane.

When You Should Not Disaggregate

Disaggregation adds a router, a KV cache transport dependency, and operational complexity around node pool scaling and failure handling. If your workload is single-node, low-concurrency, or dominated by short prompts (under 2K tokens), the head-of-line blocking problem is minimal and chunked prefill is sufficient. If your team lacks RDMA-capable networking — NIXL over TCP works but adds latency that erodes the gains — the economics flip. And if you are serving a model small enough that KV cache transfers are trivially fast, the overhead of the split may exceed the benefit.

The honest engineering answer: disaggregation wins when you are multi-node, high-concurrency, serving reasoning models with long contexts, or mixing heterogeneous GPU tiers. For everyone else, chunked prefill plus continuous batching is the simpler path to 80% of the gain. If you are also battling MoE memory overhead or K8s GPU capacity waste, disaggregation compounds with those fixes rather than competing with them.

Cloud AI

Prefill-Decode Disaggregation: NVIDIA’s 7x Inference Fix