Naive static batching leaves roughly 60% of an H100 GPU idle during LLM serving, because finished requests hold their slots until the slowest sequence in the batch completes. Continuous batching — iteration-level scheduling introduced in the Orca paper and now the default in vLLM, TensorRT-LLM and TGI — fixes this by reshuffling the batch after every decode step, lifting GPU utilization from 30–40% to 75–85% and delivering 3–5x more traffic per GPU.
Key points:
- Static batching suffers 60–80% padding overhead on variable-length LLM output, holding GPU slots idle until the longest request finishes.
- Continuous batching exploits the natural scheduling boundary between autoregressive decode steps, releasing finished requests and admitting queued ones each iteration.
- The 2022 Orca paper showed iteration-level scheduling reaching up to 36.9x throughput over prior serving systems — with no model changes.
- Combined with PagedAttention, vLLM improves throughput 2–4x at equal latency, and beats HuggingFace TGI by up to 24x under high concurrency.
- For latency-sensitive single-user workloads, the technique can hurt — TGI’s tighter batching wins on tail latency there.
Static Batching Wastes GPUs
The default way to run inference on a transformer looks efficient on paper and falls apart on real traffic. You collect N requests into a fixed batch, launch a forward pass, and the whole batch finishes together. The flaw is autoregressive generation: every request produces a different number of output tokens, and you cannot predict the length when you launch the batch.
If 16 requests enter a batch and the longest generates 512 tokens while the shortest stops at 64, the 15 shorter sequences occupy GPU slots for 448 tokens they will never emit. Those slots execute no useful work. Measurements of real inference workloads show padding overhead of 60–80% for typical batch sizes and sequence-length distributions, documented systematically in the PagedAttention paper (Kwon et al., SOSP 2023). The practical symptom is unambiguous: deploy a naive PyTorch inference loop, run nvidia-smi dmon at moderate load, and watch streaming-multiprocessor (SM) utilization hover around 30–40% on hardware you pay for by the hour (Spheron, 2026).
| Dimension | Static batching | Continuous (iteration-level) batching |
|---|---|---|
| Scheduling unit | Entire batch, start-to-finish | Single decode iteration |
| Slot release | Only when slowest request done | Immediately on request completion |
| Typical GPU SM utilization | 30–40% | 75–85% |
| Padding / idle overhead | 60–80% | Near zero |
| Throughput vs naive loop | 1x (baseline) | 3–5x on H100 |
Iteration-Level Scheduling
The fix comes from recognizing a scheduling boundary that static batching ignores. Between any two decoding iterations the model has completed one forward pass and is about to begin the next. At that boundary the scheduler has complete freedom to change the composition of the batch: drop requests that emitted an EOS token, free their KV cache, and admit waiting requests — all without interrupting any in-flight sequence (Brenndoerfer, 2026).
This is the contribution of Orca, the OSDI 2022 paper that first applied iteration-level scheduling to transformer serving. Orca’s selection-based scheduling reached up to 36.9x higher throughput than the then-existing systems it was measured against — and the technique requires no modification to the model weights, only to the serving infrastructure (Yu et al., OSDI 2022). That result is what made continuous batching the default rather than an optimization. In vLLM v0.18.0 you do not enable it; it is always on (Spheron, 2026).
PagedAttention Removes Memory Waste
Iteration-level scheduling creates a new problem the moment you apply it: requests now arrive and depart continuously, so the per-request KV cache — which grows and shrinks dynamically as tokens are generated — becomes a fragmentation nightmare under a conventional contiguous allocator. Pre-allocate worst-case and you waste memory; allocate on demand and you fragment the address space, capping how many concurrent sequences fit (Kwon et al., 2023).
PagedAttention imports the operating system’s virtual-memory trick into the attention kernel: KV cache is stored in fixed-size, non-contiguous blocks mapped through a page table, so a request never needs a contiguous region and blocks are recycled the instant a sequence ends. On top of this, vLLM reports near-zero waste in KV cache memory and a 2–4x throughput improvement at the same latency versus FasterTransformer and Orca, with the gap widening for longer sequences and larger models (Kwon et al., 2023). Continuous batching and PagedAttention are co-dependent: the scheduler can only admit a new request each iteration if it can actually place that request’s KV cache, which is only possible because paging eliminates fragmentation.
Chunked Prefill Fixes Head-of-Line Blocking
Continuous batching on its own still has a blind spot: the prefill phase. Prefilling a long prompt is compute-bound and runs on all sequences at once, so a 4K-token prompt arriving mid-stream blocks every queued request behind it — classic head-of-line blocking that inflates time-to-first-token (TTFT). A 2025 empirical study of vLLM versus HuggingFace TGI confirms the asymmetry: vLLM’s iteration-level scheduler admits new requests immediately as slots free, but a single heavy prefill still stalls the pipeline (arXiv, 2025).
The production answer is chunked prefill (--enable-chunked-prefill in vLLM), which splits a long prefill across iterations so decode steps from other requests can interleave. On mixed workloads this cuts TTFT p95 by 50–70% (Spheron, 2026). This is also why disaggregated prefill-decode architectures — separating the two phases onto different hardware — have become a 2026 trend; see our deep dive on prefill-decode disaggregation for that path.
Tuning vLLM For Real Throughput
The scheduler is on by default, but the knobs that decide whether you hit 3x or 5x are two flags. --max-num-seqs (default 1024 in vLLM V1) caps concurrent sequences in the scheduler — raise it to 2048+ for high-traffic APIs with short outputs. --max-num-batched-tokens sets total tokens processed per iteration across all sequences (dynamic, typically 8192–32768); push it to 16384–32768 for throughput-optimized batch jobs (Spheron, 2026).
The trade-off is mechanical: more concurrent sequences means larger batched matmuls that saturate the GPU, but also more KV cache resident in HBM, which competes with model weights for the same memory budget. You are sizing to the knee of that curve. In aggregate, the stack of continuous batching + PagedAttention + chunked prefill is what lets vLLM serve 3–5x more traffic than a naive PyTorch loop on the same H100 (Spheron, 2026), and under high concurrency vLLM reaches up to 24x the throughput of HuggingFace TGI via PagedAttention (arXiv, 2025). Stacked with semantic caching and model routing, these scheduling gains are why disciplined teams report cutting managed-API spend 50–90% without touching model quality (Digital Applied, 2026).
Where Continuous Batching Hurts
Continuous batching optimizes aggregate throughput, not individual-request latency, and the two are not the same objective. When you admit more requests per iteration, each request shares the GPU with more neighbors, so per-token latency for any single user can rise even as total tokens-per-second climbs. The 2025 benchmarking study is explicit about this regime split: vLLM dominates high-throughput batch processing, but HuggingFace TGI shows lower tail latencies for interactive, single-user scenarios with moderate concurrency (arXiv, 2025).
Two more caveats matter in production. First, the gains are memory-bound: if your model already saturates HBM with weights, raising --max-num-seqs just triggers eviction and degrades rather than helps. Second, small models under low load see little benefit because there are not enough concurrent requests to fill slots in the first place — the technique earns its keep at the high-concurrency, variable-length regime that defines real LLM API traffic. For more on the cost side of that decision, see our H100 benchmarks comparing vLLM, TensorRT-LLM and SGLang in 2026.
Related reading:
- Prefill-decode disaggregation: NVIDIA’s 7x inference fix
- vLLM vs TensorRT-LLM vs SGLang: H100 benchmarks 2026
- How quantization halved our 70B LLM inference cost in 2026
References
- Yu, G.-I. et al. “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022 — USENIX proceedings
- Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 — arXiv:2309.06180
- Spheron. “LLM Serving Optimization: Continuous Batching, PagedAttention, and Chunked Prefill on H100 (2026).” — spheron.network
- Brenndoerfer, M. “Continuous Batching: Optimizing LLM Inference Throughput.” Language AI Handbook, 2026 — mbrenndoerfer.com
- “A Performance Study of vLLM and HuggingFace TGI.” arXiv, 2025 — arXiv:2511.17593
- Digital Applied. “AI Inference Cost Optimization: FinOps Playbook 2026.” — digitalapplied.com