Speculative Decoding Cuts MoE Inference Cost by 19%

19% Cheaper Inference, Zero Accuracy Loss

Red Hat benchmarked gpt-oss-120B with Eagle3 speculative decoding on vLLM v0.13.0 and measured a 19.4% reduction in cost per 1M output tokens on H200 GPUs running SWE-bench workloads. Not a benchmark trick — production throughput at 200 concurrent requests, with output distributions mathematically identical to baseline. The technique is called speculative decoding, and if you serve LLMs at scale, you need to understand why the old assumption — “it only helps at low concurrency” — is wrong for mixture-of-experts models (Red Hat, 2026).

Why Autoregressive Decoding Wastes Your GPU

Every token an LLM generates requires a full forward pass. The GPU reloads weights, reads the KV cache, computes attention — all for one token. On an H200 with 141 GB of VRAM, the compute units sit mostly idle during decode because the bottleneck is memory bandwidth, not FLOPs. NVIDIA’s own analysis confirms that autoregressive generation leaves hardware “drastically under-utilized” (NVIDIA, 2025). Google’s TPU v8 took a hardware approach to this same KV cache bottleneck by putting cache management directly on silicon (cloudai.pt); speculative decoding is the software-side answer.

This isn’t a marginal inefficiency. For decode-heavy workloads — writing assistants, code generation agents, summarization pipelines — the decode phase dominates total inference time. A model generating 500 output tokens from a 50-token prompt spends ~90% of its compute in the sequential decode loop, each step producing a single token while the GPU’s tensor cores wait for the next memory fetch.

Speculative Decoding in Practice

The idea is straightforward: a small, fast draft model proposes K next tokens in rapid succession, then the large target model verifies all K tokens in a single forward pass. Accepted tokens are committed. Rejected tokens are discarded and the target model’s own prediction at the rejection point becomes the new starting point.

There are two main implementations:

  • Draft-target approach: A separately trained smaller model (e.g., Qwen3-1.7B drafting for Qwen3-32B). AWS benchmarked this on Trainium2 with vLLM and found up to 3x speedup on decode-heavy workloads (AWS, 2026). The draft and target models must share a tokenizer and vocabulary.
  • Eagle3 method: A lightweight autoregressive head attached to the target model’s internal layers. No separate draft model to manage — it reuses features already computed by the target. This is what Red Hat used in their gpt-oss-120B benchmarks.

The critical property: speculative decoding is mathematically lossless. Accepted tokens follow the exact target distribution. Evaluation benchmarks run identically. For regulated industries, this means no re-certification overhead. The technique builds on speculative sampling, first formalized by DeepMind and later extended to autoregressive transformers (UC Berkeley EECS, 2025).

The Numbers That Matter

Red Hat’s benchmark across three real-world datasets — ShareGPT (conversational), MLPerf (enterprise summarization), and SWE-bench (code generation) — produced consistent results:

MetricShareGPT ImprovementSWE-bench Improvement
Output throughput+20.7%+20.5%
Request latency (median)-20.3%-15.9%
ITL P95-12.4%-17.5%
TTFT P95-10.8%-0.3%

The SWE-bench TTFT result is instructive: code prompts are long and unique, so prefix cache hit rates are near zero and the prefill phase dominates. Speculative decoding doesn’t help prefill (it’s already parallelized). The gains come entirely from accelerating the decode phase, which is exactly where the money goes in production (Red Hat, 2026).

At peak utilization on an H200 priced at $41.62/hr (AWS on-demand equivalent), the cost per 1M output tokens drops from $4.41 to $3.56 on SWE-bench. That’s $0.85 saved per million tokens with zero changes to model weights or serving infrastructure beyond enabling the feature.

High Concurrency Works on MoE

The conventional wisdom in the inference community has been that speculative decoding is a low-QPS optimization — it helps when you have few concurrent requests but the overhead of draft generation cancels out at high load. Red Hat’s data directly contradicts this for MoE architectures.

With gpt-oss-120B (a mixture-of-experts model), throughput gains persist at 200 concurrent requests. The geometric mean output throughput improvement holds at +20.7% across all six concurrency levels tested (1, 5, 25, 50, 100, 200). Why? MoE models activate only a subset of experts per token, leaving significant compute headroom that speculative decoding exploits during verification passes. Dense models may not show the same behavior — BentoML’s benchmarks suggest the speedup curve flattens earlier for dense architectures (BentoML, 2026).

Tensor parallelism doesn’t kill the gains either. At TP=2 on the MLPerf dataset, output throughput improves by 16% and median request latency drops 12.4%. The one caveat: TTFT P95 regresses 9.3% at TP=2 because the draft model’s additional compute during prefill competes with the target for GPU resources that are already better utilized. If your workload is prefill-heavy (long prompts, short outputs), measure before enabling. For a broader view of how GPU scheduling is evolving to handle these tradeoffs, see our breakdown of how DRA replaced the Kubernetes GPU device plugin.

Draft Tokens: The Tuning Knob

The num_speculative_tokens parameter controls how many tokens the draft model proposes per step. More isn’t better. Red Hat tested 2, 3, and 4 draft tokens on ShareGPT:

Draft TokensAcceptance RateMean Acceptance LengthPeak Throughput (tok/s)
245.4%1.912,480
335.6%2.072,574
428.3%2.132,240

The 4-draft configuration is strictly worse — lower acceptance rate, lower throughput, higher ITL. The additional draft token gets rejected more often than not, adding verification overhead without commensurate gain. The sweet spot is 2–3 draft tokens. AWS independently arrived at 7 speculative tokens for their Qwen3-1.7B/Qwen3 pairing on Trainium, which highlights that the optimal value depends on the draft-target agreement rate for your specific model pair and workload (AWS, 2026).

The tuning procedure is straightforward:

  1. Start with 3 speculative tokens.
  2. Run your production workload mix at expected concurrency levels.
  3. Extract acceptance rate from vLLM server logs.
  4. If acceptance rate > 50%, try increasing to 4–5 tokens.
  5. If acceptance rate < 30%, drop to 2 tokens or reconsider your draft model.

Where This Breaks Down

Speculative decoding is not a universal win. Three scenarios where you should think carefully:

Prefill-heavy workloads. If your requests have long prompts and short outputs (RAG retrieval, document Q&A), the decode phase is a small fraction of total latency. Speculative decoding does nothing for prefill, and the draft overhead can make TTFT worse — as Red Hat observed at TP=2.

Poor draft-target alignment. AWS found that a Qwen3-0.6B draft model had ~60% lower acceptance rate than Qwen3-1.7B when drafting for the same target. The smaller model was faster per token but so many tokens got rejected that the net effect was negative. Models from different architectural families sharing only a tokenizer are risky — measure acceptance rates before committing to production.

Latency-sensitive streaming at low concurrency. At concurrency 1, the overhead of draft generation adds to p99 latency even if average throughput improves. If you’re serving a real-time chatbot with strict tail-latency SLAs and low QPS, the draft model’s compute may hurt more than it helps.

Enabling It in Production

In vLLM v0.13.0+, enabling Eagle3 speculative decoding requires minimal configuration changes:

  • Load the Eagle3 draft head alongside your target model (available on Hugging Face for popular model families).
  • Set --speculative-decoding eagle3 and --num-speculative-tokens 3 in your vLLM serving config.
  • Enable prefix caching (already default in vLLM) — it interacts positively with speculative decoding by reducing redundant prefill work.
  • Benchmark with GuideLLM or LLMPerf at your production concurrency levels before flipping the switch.

For teams running on managed infrastructure, Red Hat AI Inference Server ships speculative decoding support out of the box. AWS Trainium users get native support via NxD Inference with four modes including vanilla draft-target and EAGLE-based approaches. The tooling is mature enough that the barrier to entry is a benchmark run, not a research project. For context on where this fits in the broader inference infrastructure landscape, see our taxonomy of AI cloud categories platform teams need to know.

References