Choosing between vLLM, TensorRT-LLM, and SGLang in 2026 comes down to three questions: how many models you serve, how fast you need to go live, and whether your workload shares prefixes. Benchmarks on H100 80GB with Llama 3.3 70B at FP8 show TensorRT-LLM delivering 13% higher throughput than vLLM at 50 concurrent requests, but requiring a 28-minute compilation step that vLLM skips entirely. The right choice depends entirely on your deployment constraints.
Why Your Engine Matters
Once a model leaves training and enters production, the inference engine becomes the single biggest determinant of latency, throughput, and GPU cost. The same model served on different engines can deliver dramatically different performance characteristics — not because one engine is universally superior, but because each optimizes for a different bottleneck in the inference pipeline.
vLLM optimizes for concurrency and memory efficiency via PagedAttention, a system that treats GPU memory like virtual memory pages. TensorRT-LLM optimizes for raw hardware utilization through compiled CUDA kernel graphs tailored to specific GPU configurations. SGLang optimizes for shared-prefix workloads using RadixAttention, which caches attention activations in a radix tree keyed by token sequence.
All three were tested under controlled conditions on a single H100 SXM5 80GB instance with Llama 3.3 70B Instruct at FP8 precision. Framework versions: vLLM v0.18.0, TensorRT-LLM v1.2.0, SGLang v0.5.9. Each benchmark ran 200 diverse prompts (512 input tokens, 256 output tokens average) across four concurrency levels after a 60-second warm-up period.
Throughput Comparison
Output tokens per second is the metric that determines how many users your GPU can serve simultaneously. Here is how the three engines compare across concurrency levels:
| Concurrency | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|
| 1 req | 120 tok/s | 130 tok/s | 125 tok/s |
| 10 req | 650 tok/s | 710 tok/s | 680 tok/s |
| 50 req | 1,850 tok/s | 2,100 tok/s | 1,920 tok/s |
| 100 req | 2,400 tok/s | 2,780 tok/s | 2,460 tok/s |
TensorRT-LLM leads at every concurrency level. The gap is smallest at low concurrency (8% faster than vLLM at 1 request) and widest at 50 concurrent requests (13% faster). SGLang sits between vLLM and TensorRT-LLM at high concurrency when prompts are unique — its RadixAttention advantage only materializes when requests share prefixes.
Latency: Time to First Token
TTFT determines whether your application feels instant or sluggish. The p95 numbers are what matter for production SLAs:
| Concurrency | vLLM p95 | TRT-LLM p95 | SGLang p95 |
|---|---|---|---|
| 10 req | 195 ms | 170 ms | 178 ms |
| 50 req | 720 ms | 620 ms | 680 ms |
| 100 req | 1,450 ms | 1,280 ms | 1,380 ms |
At 100 concurrent requests, the 170 ms gap between TensorRT-LLM and vLLM on p95 TTFT directly affects perceived responsiveness in interactive applications. For chatbot deployments where users abandon after 1 second of delay, this difference is material.
Cold Start and Deployment
Cold start time determines whether your engine supports auto-scaling from zero, blue-green deployments, and rapid model updates:
| Engine | Time to First Request |
|---|---|
| vLLM | ~62 seconds |
| TensorRT-LLM (compiled) | ~28 minutes |
| TensorRT-LLM (PyTorch backend) | ~60-90 seconds |
| SGLang | ~58 seconds |
TensorRT-LLM’s 28-minute compilation is not a flaw but a deliberate tradeoff. The build runs once per model version, saves a compiled engine binary to disk, and subsequent restarts reload the cached engine in about 90 seconds. The new PyTorch backend (stable since v1.0) loads HuggingFace weights directly, cutting cold start to 60-90 seconds at the cost of lower peak throughput.
VRAM Footprint Analysis
All three engines operate within 4 GB of each other on peak VRAM with a 70B FP8 model on 80 GB H100. This means VRAM is rarely the deciding factor:
| Engine | Idle | Peak (100 req) |
|---|---|---|
| vLLM | 71 GB | 78 GB |
| TensorRT-LLM | 74 GB | 79 GB |
| SGLang | 72 GB | 78 GB |
TensorRT-LLM’s compiled engine takes slightly more idle VRAM because it stores additional activation buffers. SGLang uses the least VRAM at peak load due to its efficient KV cache management. When VRAM is your bottleneck, engine choice matters less than your max-model-len and gpu-memory-utilization settings.
Quick Start Commands
vLLM — FP8 with a single flag, OpenAI-compatible API:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95TensorRT-LLM — Two-step: quantize then compile:
python3 quantize.py --model_dir ./llama-70b \
--qformat fp8 --output_dir ./quantized
trtllm-build --checkpoint_dir ./quantized \
--output_dir ./engine --use_fp8SGLang — RadixAttention enabled by default:
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--quantization fp8 --port 30000Production Monitoring and Cost Optimization
Choosing the right engine is only half the battle. Sustaining performance and controlling GPU spend in production requires observability and cost-aware architecture. Below are the considerations that separate a functional deployment from an efficient one.
Cost per 1M Output Tokens
On-demand H100 80GB pricing varies by cloud provider, but the relative cost structure stays consistent. Using $3.50/hr as a reference rate, here is the estimated cost per million output tokens at typical concurrency levels:
| Engine | Tok/s at 50 req | Cost/1M Tokens | Monthly (10M tok/day) |
|---|---|---|---|
| vLLM | 1,850 | $0.53 | $159 |
| TensorRT-LLM | 2,100 | $0.46 | $139 |
| SGLang | 1,920 | $0.51 | $153 |
TensorRT-LLM’s throughput advantage translates to roughly 13% lower token cost. However, if your workload is bursty and you rely on auto-scaling, the 28-minute compile time of TensorRT-LLM means you need a warm standby instance — effectively doubling your baseline cost. vLLM and SGLang can scale from zero to serving in under 90 seconds, making them far more suitable for workloads with significant idle periods.
Essential Metrics to Track
Regardless of your engine choice, track these four metrics in your observability stack. All three engines expose Prometheus-compatible endpoints by default:
# vLLM Prometheus endpoint
curl localhost:8000/metrics | grep vllm
# Key metrics to alert on:
# vllm:num_requests_running — active requests (queue depth signal)
# vllm:avg_generation_throughput — tokens/sec over 1-min window
# vllm:gpu_cache_usage_perc — KV cache utilization (scale trigger)
# vllm:num_requests_waiting — queued requests (latency predictor)Set up alerts on two thresholds: num_requests_waiting > 20 signals you need to scale out within minutes, and gpu_cache_usage_perc > 90% warns of imminent OOM errors before they cascade. For SGLang, the equivalent metric prefix is sglang_ and for TensorRT-LLM, look under nvidia_trt_llm_.
Practical Deployment Tips
- GPU affinity matters. Pin your inference process to a specific GPU with
CUDA_VISIBLE_DEVICES=0and disable CPU frequency scaling withperformancegovernor. Unpinned deployments can lose 5–8% throughput to context switching. - Use continuous batching. All three engines enable it by default, but vLLM’s
--enable-chunked-prefilland SGLang’s--chunked-prefill-size 256can reduce TTFT by 15–20% on long-context requests by interleaving prefill and decode phases. - Reserve headroom. Set
gpu-memory-utilizationto 0.90 instead of 0.95. The 5% headroom prevents CUDA OOM crashes during traffic spikes and costs you less than 2% throughput. - Cache compiled engines. For TensorRT-LLM, store your compiled engine in a persistent volume or object storage. Rebuilding from scratch on every deploy wastes 28 minutes and defeats the purpose of compilation.
- Load test before going live. Use
locustor the built-in benchmarking tools to simulate your expected concurrency profile. Real-world traffic patterns — especially the ratio of short vs. long prompts — can shift the rankings between engines.
Decision Framework
- Many models, frequent updates → vLLM (no compile, widest support)
- One model, maximum throughput → TensorRT-LLM (compiled engine)
- Shared prefixes, chatbot/RAG → SGLang (RadixAttention caching)
- Auto-scaling from zero → vLLM or SGLang (under 90s cold start)
- Limited DevOps capacity → vLLM (simplest deployment)
Key Takeaways
- Start with vLLM — widest model support, no compilation, 62-second cold start
- TensorRT-LLM wins on throughput — 13% faster at 50 concurrent, but 28-min compile
- SGLang for shared prefixes — RadixAttention cuts TTFT when requests share context
- PyTorch backend — TensorRT-LLM v1.0+ alternative skips compilation at cost of peak throughput
- VRAM difference is minimal — under 4 GB across all three engines