vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks 2026

Choosing between vLLM, TensorRT-LLM, and SGLang in 2026 comes down to three questions: how many models you serve, how fast you need to go live, and whether your workload shares prefixes. Benchmarks on H100 80GB with Llama 3.3 70B at FP8 show TensorRT-LLM delivering 13% higher throughput than vLLM at 50 concurrent requests, but requiring a 28-minute compilation step that vLLM skips entirely. The right choice depends entirely on your deployment constraints.

Why Your Engine Matters

Once a model leaves training and enters production, the inference engine becomes the single biggest determinant of latency, throughput, and GPU cost. The same model served on different engines can deliver dramatically different performance characteristics — not because one engine is universally superior, but because each optimizes for a different bottleneck in the inference pipeline.

vLLM optimizes for concurrency and memory efficiency via PagedAttention, a system that treats GPU memory like virtual memory pages. TensorRT-LLM optimizes for raw hardware utilization through compiled CUDA kernel graphs tailored to specific GPU configurations. SGLang optimizes for shared-prefix workloads using RadixAttention, which caches attention activations in a radix tree keyed by token sequence.

All three were tested under controlled conditions on a single H100 SXM5 80GB instance with Llama 3.3 70B Instruct at FP8 precision. Framework versions: vLLM v0.18.0, TensorRT-LLM v1.2.0, SGLang v0.5.9. Each benchmark ran 200 diverse prompts (512 input tokens, 256 output tokens average) across four concurrency levels after a 60-second warm-up period.

Throughput Comparison

Output tokens per second is the metric that determines how many users your GPU can serve simultaneously. Here is how the three engines compare across concurrency levels:

ConcurrencyvLLMTensorRT-LLMSGLang
1 req120 tok/s130 tok/s125 tok/s
10 req650 tok/s710 tok/s680 tok/s
50 req1,850 tok/s2,100 tok/s1,920 tok/s
100 req2,400 tok/s2,780 tok/s2,460 tok/s

TensorRT-LLM leads at every concurrency level. The gap is smallest at low concurrency (8% faster than vLLM at 1 request) and widest at 50 concurrent requests (13% faster). SGLang sits between vLLM and TensorRT-LLM at high concurrency when prompts are unique — its RadixAttention advantage only materializes when requests share prefixes.

Latency: Time to First Token

TTFT determines whether your application feels instant or sluggish. The p95 numbers are what matter for production SLAs:

ConcurrencyvLLM p95TRT-LLM p95SGLang p95
10 req195 ms170 ms178 ms
50 req720 ms620 ms680 ms
100 req1,450 ms1,280 ms1,380 ms

At 100 concurrent requests, the 170 ms gap between TensorRT-LLM and vLLM on p95 TTFT directly affects perceived responsiveness in interactive applications. For chatbot deployments where users abandon after 1 second of delay, this difference is material.

Cold Start and Deployment

Cold start time determines whether your engine supports auto-scaling from zero, blue-green deployments, and rapid model updates:

EngineTime to First Request
vLLM~62 seconds
TensorRT-LLM (compiled)~28 minutes
TensorRT-LLM (PyTorch backend)~60-90 seconds
SGLang~58 seconds

TensorRT-LLM’s 28-minute compilation is not a flaw but a deliberate tradeoff. The build runs once per model version, saves a compiled engine binary to disk, and subsequent restarts reload the cached engine in about 90 seconds. The new PyTorch backend (stable since v1.0) loads HuggingFace weights directly, cutting cold start to 60-90 seconds at the cost of lower peak throughput.

VRAM Footprint Analysis

All three engines operate within 4 GB of each other on peak VRAM with a 70B FP8 model on 80 GB H100. This means VRAM is rarely the deciding factor:

EngineIdlePeak (100 req)
vLLM71 GB78 GB
TensorRT-LLM74 GB79 GB
SGLang72 GB78 GB

TensorRT-LLM’s compiled engine takes slightly more idle VRAM because it stores additional activation buffers. SGLang uses the least VRAM at peak load due to its efficient KV cache management. When VRAM is your bottleneck, engine choice matters less than your max-model-len and gpu-memory-utilization settings.

Quick Start Commands

vLLM — FP8 with a single flag, OpenAI-compatible API:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

TensorRT-LLM — Two-step: quantize then compile:

python3 quantize.py --model_dir ./llama-70b \
  --qformat fp8 --output_dir ./quantized

trtllm-build --checkpoint_dir ./quantized \
  --output_dir ./engine --use_fp8

SGLang — RadixAttention enabled by default:

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --quantization fp8 --port 30000

Production Monitoring and Cost Optimization

Choosing the right engine is only half the battle. Sustaining performance and controlling GPU spend in production requires observability and cost-aware architecture. Below are the considerations that separate a functional deployment from an efficient one.

Cost per 1M Output Tokens

On-demand H100 80GB pricing varies by cloud provider, but the relative cost structure stays consistent. Using $3.50/hr as a reference rate, here is the estimated cost per million output tokens at typical concurrency levels:

EngineTok/s at 50 reqCost/1M TokensMonthly (10M tok/day)
vLLM1,850$0.53$159
TensorRT-LLM2,100$0.46$139
SGLang1,920$0.51$153

TensorRT-LLM’s throughput advantage translates to roughly 13% lower token cost. However, if your workload is bursty and you rely on auto-scaling, the 28-minute compile time of TensorRT-LLM means you need a warm standby instance — effectively doubling your baseline cost. vLLM and SGLang can scale from zero to serving in under 90 seconds, making them far more suitable for workloads with significant idle periods.

Essential Metrics to Track

Regardless of your engine choice, track these four metrics in your observability stack. All three engines expose Prometheus-compatible endpoints by default:

# vLLM Prometheus endpoint
curl localhost:8000/metrics | grep vllm

# Key metrics to alert on:
#   vllm:num_requests_running    — active requests (queue depth signal)
#   vllm:avg_generation_throughput — tokens/sec over 1-min window
#   vllm:gpu_cache_usage_perc    — KV cache utilization (scale trigger)
#   vllm:num_requests_waiting    — queued requests (latency predictor)

Set up alerts on two thresholds: num_requests_waiting > 20 signals you need to scale out within minutes, and gpu_cache_usage_perc > 90% warns of imminent OOM errors before they cascade. For SGLang, the equivalent metric prefix is sglang_ and for TensorRT-LLM, look under nvidia_trt_llm_.

Practical Deployment Tips

  • GPU affinity matters. Pin your inference process to a specific GPU with CUDA_VISIBLE_DEVICES=0 and disable CPU frequency scaling with performance governor. Unpinned deployments can lose 5–8% throughput to context switching.
  • Use continuous batching. All three engines enable it by default, but vLLM’s --enable-chunked-prefill and SGLang’s --chunked-prefill-size 256 can reduce TTFT by 15–20% on long-context requests by interleaving prefill and decode phases.
  • Reserve headroom. Set gpu-memory-utilization to 0.90 instead of 0.95. The 5% headroom prevents CUDA OOM crashes during traffic spikes and costs you less than 2% throughput.
  • Cache compiled engines. For TensorRT-LLM, store your compiled engine in a persistent volume or object storage. Rebuilding from scratch on every deploy wastes 28 minutes and defeats the purpose of compilation.
  • Load test before going live. Use locust or the built-in benchmarking tools to simulate your expected concurrency profile. Real-world traffic patterns — especially the ratio of short vs. long prompts — can shift the rankings between engines.

Decision Framework

  1. Many models, frequent updates → vLLM (no compile, widest support)
  2. One model, maximum throughput → TensorRT-LLM (compiled engine)
  3. Shared prefixes, chatbot/RAG → SGLang (RadixAttention caching)
  4. Auto-scaling from zero → vLLM or SGLang (under 90s cold start)
  5. Limited DevOps capacity → vLLM (simplest deployment)

Key Takeaways

  • Start with vLLM — widest model support, no compilation, 62-second cold start
  • TensorRT-LLM wins on throughput — 13% faster at 50 concurrent, but 28-min compile
  • SGLang for shared prefixes — RadixAttention cuts TTFT when requests share context
  • PyTorch backend — TensorRT-LLM v1.0+ alternative skips compilation at cost of peak throughput
  • VRAM difference is minimal — under 4 GB across all three engines

Sources

Tags: 205920592059