On H100 SXM5 80GB running Llama 3.3 70B Instruct at FP8, SGLang serves 1,920 tokens per second at 50-way concurrency — just 3.8% faster than vLLM’s 1,850. But swap to Llama 3.1 8B, and that gap explodes to 29%: SGLang hits 16,200 tok/s versus vLLM’s 12,500. The inference engine you pick isn’t a vanity decision. It determines whether your GPU fleet runs at 60% or 95% utilization, and whether your multi-turn agent pipeline recomputes KV cache on every request or reuses 80% of it.
TGI Is Dead. Pick a Side.
Hugging Face put Text Generation Inference (TGI) into maintenance mode in December 2025. No new features — only bug fixes. HF’s own Inference Endpoints now default to vLLM, with SGLang as the alternative. That leaves two open-source engines that speak the OpenAI API, run on NVIDIA (and AMD), and have production-grade continuous batching. If you’re standing up a self-hosted inference stack in 2026, the TGI migration question is settled. The real question is which of these two actually fits your workload shape — and the answer depends on something most teams never measure: your prefix overlap ratio.
Throughput Benchmarks: Smaller Than the Hype
Spheron ran the most cited head-to-head in March 2026: vLLM v0.18.0, SGLang v0.5.9, and TensorRT-LLM v1.2.0, all on the same H100 80GB with Llama 3.3 70B Instruct at FP8 precision. The setup used 200 prompts averaging 512 input and 256 output tokens at concurrency levels of 1, 10, 50, and 100.
| Concurrency | vLLM (tok/s) | SGLang (tok/s) | TTFT p50 vLLM | TTFT p50 SGLang |
|---|---|---|---|---|
| 1 | 120 | 125 | 45 ms | 42 ms |
| 10 | 650 | 680 | 120 ms | 112 ms |
| 50 | 1,850 | 1,920 | 380 ms | 360 ms |
| 100 | 2,400 | 2,460 | 740 ms | 710 ms |
Source: Spheron H100 Benchmarks (2026)
At 70B scale, the delta is 3-5% across every concurrency level. That’s meaningful at fleet scale — a 4% throughput gain across 50 H100s is real money — but it won’t change your architecture. The 8B numbers tell a different story. PremAI measured SGLang at roughly 16,200 tok/s versus vLLM’s 12,500 — a gap large enough to halve your GPU count for high-volume classification or extraction workloads (TECHSY benchmarks, updated May 2026).
The pattern makes architectural sense: SGLang’s RadixAttention pays off more when prefill is a larger fraction of total compute, which happens with smaller models, shorter outputs, and shared system prompts.
RadixAttention: The Real Game-Changer
This is where the two engines diverge architecturally, not just numerically. vLLM’s PagedAttention manages KV cache at the block level using hash-based lookup. SGLang’s RadixAttention — introduced in the original SGLang paper by Zheng et al. — stores cached attention activations in a radix tree keyed by token sequence. When a new request shares a prefix with an existing cached entry, SGLang starts computation from the branching point instead of position zero.
The practical impact is dramatic for agent workloads. Every agent turn arrives carrying a long, mostly-static context: tool definitions, memory state, prior conversation turns. A standard inference server treats each request as independent and recomputes attention from scratch. SGLang’s radix tree walks the token sequence, finds the longest matching prefix, and skips recomputation for everything before the branch point.
Workloads where agents share a fixed system prompt and tool definitions across sessions see 75-95% cache hit rates on multi-turn conversations, according to Spheron’s deployment guide (Spheron SGLang Production Guide). That’s not a 5% optimization — it’s an order-of-magnitude reduction in prefill compute for the dominant cost component of agentic inference. This is the same cache regeneration problem that GPU schedulers waste 38% of their time on.
vLLM does support prefix caching via its Automatic Prefix Caching (APC) feature, enabled with --enable-prefix-caching. But APC operates at the block level with hash-based matching, while RadixAttention’s token-level radix tree achieves finer-grained reuse — particularly on multi-turn conversations where each turn extends the previous context by a few hundred tokens.
Structured Generation: SGLang’s Hidden Edge
If you’re serving JSON-structured outputs — and in 2026, most production LLM workloads are — the structured generation path matters as much as raw throughput. SGLang’s compressed finite state machine (FSM) overlaps mask generation with the GPU inference step, avoiding the serialization stall that plagues other engines. SqueezeBits benchmarks found vLLM shows significant throughput degradation with guided decoding enabled, especially at batch size 8 and above (TECHSY/SqueezeBits analysis).
The numbers from Morph’s comparison are striking: SGLang’s compressed FSM reduces latency by up to 2x and boosts throughput by up to 2.5x compared to standard guided decoding approaches, with JSON schema compliance reaching 96-98% (Morph comparison). The original SGLang paper reported up to 6.4x higher throughput on JSON decoding tasks compared to baseline systems (Zheng et al., 2024).
vLLM supports guided decoding via outlines and lm-format-enforcer integrations, but the overhead is visible under load. If your production stack is function-calling heavy — agent tool selection, structured data extraction, API response formatting — SGLang’s native structured generation is a meaningful architectural advantage, not a marginal benchmark win.
Cold Starts and Operational Reality
Throughput and latency dominate benchmarks, but cold start time determines whether your autoscaling works. Spheron’s numbers paint a brutal picture: vLLM cold starts in approximately 62 seconds, SGLang in 58 seconds, but TensorRT-LLM takes about 28 minutes due to its compilation pipeline (Spheron benchmarks).
For serverless GPU deployments with scale-to-zero, that 28-minute TensorRT-LLM cold start is a non-starter without pre-warmed pools. vLLM and SGLang both cold-start in under a minute — workable for Kubernetes-based autoscaling with a small buffer of warm replicas, as we explored in our analysis of serverless GPU cold start mitigation. The LeetLLM analysis frames this well: “the engine matters most when you need fast starts and model flexibility” (LeetLLM 2026 guide).
Hardware support also diverges. vLLM runs on NVIDIA, AMD, Intel, AWS Trainium, and TPU. SGLang supports NVIDIA and AMD. If your infrastructure spans cloud providers or uses non-NVIDIA accelerators, vLLM’s broader backend support removes a hard constraint.
The Decision Matrix
Neither engine is universally better. The right choice depends entirely on workload shape:
- Choose vLLM if your workload is high-concurrency, stateless, chat-based, or needs the broadest hardware support. It’s the safer default, with the largest community (17k+ GitHub stars), mature Helm charts, and battle-tested production deployments across AWS, GCP, and Azure.
- Choose SGLang if your workload is prefix-heavy — multi-turn agents, RAG pipelines with shared system prompts, structured JSON output at scale, or multi-LoRA serving. The RadixAttention cache hit rates and compressed FSM for structured generation deliver gains that raw throughput benchmarks don’t capture. This matters most when agentic workflows already cost 5x more than budgeted.
- Choose TensorRT-LLM only if you serve a single stable model on NVIDIA-only infrastructure for months at a time, and maximum throughput per GPU justifies the 28-minute compilation cost and operational complexity.
The Yotta Labs analysis gets the framing right: “There is no universal winner between vLLM and SGLang. vLLM is optimized for efficiency and throughput at scale” while SGLang wins on “orchestration, not hardware” (Yotta Labs 2026).
What Actually Matters in 2026
The benchmark gap between vLLM and SGLang on 70B models is noise. The real engineering decisions live in three places: your prefix overlap ratio, your structured output requirements, and your operational constraints. Measure your actual workload’s cache hit rate before picking an engine. If 70%+ of your tokens are shared prefixes — and with agent workloads, they almost always are — RadixAttention’s 75-95% hit rate makes SGLang the clear winner regardless of what raw throughput charts say.
vLLM’s PagedAttention remains the more mature, broadly compatible choice for general-purpose serving. But as LLM workloads shift from stateless chat to multi-step agent pipelines with structured outputs, SGLang’s architecture is better aligned with where production traffic is actually going.
References
- Spheron — vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)
- TECHSY — vLLM vs SGLang 2026: H100 Benchmarks Inside
- Yotta Labs — vLLM vs SGLang in 2026: Speed, Throughput, and Cost
- Zheng et al. — SGLang: Efficient Execution of Structured Language Model Programs (arXiv)
- Spheron — SGLang Production Deployment Guide (2026)
- Morph — vLLM vs SGLang 2026: Benchmarks and Architecture
- LeetLLM — Choosing an Inference Engine in 2026