LLM Inference Nondeterminism: Why Temperature 0 Fails You

LLM inference nondeterminism means identical prompts can return different outputs even at temperature 0, because dynamic batching changes the order of floating-point reductions inside GPU kernels — a property called batch invariance. Thinking Machines Lab measured 80 distinct completions from 1,000 identical requests on Qwen3-235B, and researchers documented up to 9% accuracy swings and 9,000-token length gaps on reasoning models just from changing GPU count or batch size. The cause is numerical, not random, and it is fixable with batch-invariant kernels now shipping in SGLang and vLLM.

Inference Drifts Even at Temperature 0

The mechanism is specific. A serving engine like vLLM or SGLang continuously rebatches incoming requests. Request A might process alone in one step and alongside requests B and C in the next. When the GPU kernel reduces across a sequence, it tiles that reduction according to the current batch size — and a different batch means a different tiling, which means the same floating-point numbers get summed in a different order. Because floating-point arithmetic is non-associative ((a + b) + c ≠ a + (b + c)), the accumulated logits drift by tiny amounts, typically in the last few bits of bfloat16. Those bit-level differences are invisible until a softmax tie breaks the other way and the model picks a different token.

Why Temperature Zero Still Lies

Most engineers assume that setting temperature=0 makes an LLM deterministic. It does make sampling deterministic — greedy decoding always picks the argmax token. The problem is that the logits feeding that argmax are not stable from one request to the next, so two runs that should return identical token IDs quietly diverge. As Thinking Machines Lab documents, this holds for hosted APIs and for self-hosted stacks like vLLM and SGLang: identical hardware, identical weights, identical prompt, different bytes out.

The instinctive explanation is “GPUs are parallel, so addition order races.” That explanation is wrong in the specific way that matters. Horace He’s team demonstrates a trivial counterexample: run the same torch.mm on the same bfloat16 matrices 1,000 times and the output is bitwise-equal every iteration. Every kernel in a transformer forward pass is itself deterministic. Thread scheduling and atomic adds — the usual suspects — are not the actors here. The culprit is one layer up, in how the serving engine composes batches.

The Real Root Cause: Batch Invariance

A serving engine like vLLM continuously rebatches. Request A might be processed alone in one step and alongside requests B and C in the next. When the kernel reduces across a sequence, it tiles that reduction according to the current batch size. A different batch means a different tiling, which means the same floating-point numbers get summed in a different order. Because floating-point arithmetic is non-associative — (a + b) + c ≠ a + (b + c) — the accumulated logits drift by tiny amounts, typically in the last few bits of bfloat16.

Thinking Machines Lab names this property batch invariance: a kernel is batch-invariant if its output for a given sequence does not depend on what else happens to be in the batch. Standard attention, RMSNorm, and matmul kernels in FlashInfer and FlashAttention are not batch-invariant. The 1,000-run experiment is the proof: without intervention, 80 of 1,000 completions differed; the first 102 tokens were always identical, then 8 runs wrote “New York City” instead of “Queens, New York” — a coin flip driven by a rounding bit, not by the model’s intent. With batch-invariant kernels patched in, all 1,000 were byte-identical.

How Much Does It Actually Matter

For chatty chatbots, a different phrasing of Feynman’s birthplace is cosmetic. For anything you score or measure, it is not. A study from Rice University and collaborators, published as arXiv:2506.09501, systematically varied hardware and batch conditions on reasoning models and quantified the damage:

Condition changed (greedy, bfloat16)Effect on DeepSeek-R1-Distill-Qwen-7B
GPU count (1 vs multiple)Up to 9% accuracy variation
GPU type / driver versionDifferent token IDs from step 1
Evaluation batch sizeUp to 9,000 tokens difference in response length
Same batch, batch-invariant kernelsBitwise-identical output

Nine percent accuracy drift means your benchmark numbers are not reproducible across the hardware you happen to rent. Nine thousand tokens of length variance on a reasoning model directly distorts cost, latency, and the eval you shipped to leadership. If you re-ran your acceptance suite on a different instance class and got a different pass rate, this is why — and as Tian Pan notes in his analysis of the non-determinism tax in production, cloud-region differences in GPU generation and CUDA driver compound the effect further.

Reasoning Models Turn Bits Into Branches

The 9% swing is not linear noise that washes out over a dataset. Reasoning models autoregressively consume their own output, so a one-bit difference at an early token propagates into a different chain of thought. The Rice team emphasizes that for DeepSeek-R1-style distillates, “minor rounding differences in early tokens can cascade into divergent chains of thought.” The model is not confused; it is walking a genuinely different path because a softmax tie broke the other way on a different reduction order.

This makes nondeterminism a first-class production risk, not a benchmarking footnote. Any system that caches agent decisions, replays traces, asserts on exact output, or bills per token inherits the variance. And any A/B test comparing two model versions is contaminated by within-version noise that can exceed the between-version signal you are trying to measure.

Engineering Fixes Available Now

The good news: the fix is shipping. SGLang landed a deterministic mode built directly on Thinking Machines Lab’s batch-invariant operators, exposed as a single flag:

--enable-deterministic-inference

It requires an attention backend that implements the batch-invariant path — FlashInfer, FlashAttention 3, or Triton — and importantly still composes with the performance features you already run: chunked prefill, CUDA graphs, and radix (prefix) caching. SGLang also supports deterministic non-greedy sampling via explicit seeds (default 42), so you can reproduce a temperature-0.8 rollout bit-for-bit across machines, which matters for GRPO-style reinforcement learning.

For teams that cannot swap engines, the Rice team offers a lighter lever: LayerCast, a pipeline that stores weights in 16-bit (preserving memory budget) but casts to FP32 for computation, cutting the rounding error that seeds the cascade. vLLM’s own deterministic efforts are upstream-in-progress; the unoptimized Thinking Machines patch measured roughly 1.6× slower than default (55s vs 26s for 1,000 sequences on a single Qwen3-8B GPU), tightening to about 1.6× with an improved attention kernel (42s). That overhead sits alongside other kernel-level tradeoffs worth understanding — see how CUDA graphs and torch.compile compound decode speedups and why a 99KB community fix reshaped MoE inference kernels. The deterministic-mode cost is real but bounded, and it is the price of a result you can defend.

The Training Blind Spot Nobody Mentions

If you do reinforcement learning from verifiable rewards, this same nondeterminism is silently corrupting your gradient signal. The sampler that generates rollouts runs on inference numerics; the trainer computes logprobs on training numerics. When those differ — and they do, by the mechanism above — your nominally on-policy training is actually off-policy, with an unmeasured KL gap between the behavior and target policies.

Thinking Machines Lab shows the consequence concretely: a GRPO run without importance-weighting saw its reward collapse mid-training when that hidden KL divergence spiked. The standard fix — importance sampling correction — papers over the symptom. Achieving bitwise equivalence between sampler and trainer eliminates the KL gap entirely (it sits at a flat zero) and removes the need for the correction term. For anyone shipping RLHF/RLVR pipelines, deterministic inference is not a nice-to-have; it is a precondition for trusting your training curve.

What to Verify Before Deploying

  • Pin your batch size for evals. If you benchmark with a fixed batch of 1 on a single GPU, you remove one variable but not the others (driver, region, kernel version). Document the full matrix.
  • Enable SGLang’s deterministic flag (or the equivalent vLLM path) on any endpoint that feeds scoring, replay, billing, or caching logic.
  • Audit reasoning-model cost claims. A 9,000-token length variance means your p99 latency and token-cost estimates have a hidden error bar. Re-measure across at least two GPU types.
  • Separate within-version from between-version noise in A/B tests. If your eval variance exceeds your treatment effect, the test is uninformative.
  • Treat RL sampler/trainer numeric mismatch as a bug. If you run on-policy RL without importance weighting and without deterministic inference, your reward collapse is a feature of your stack, not your data.

Nondeterminism in LLM inference is not mysterious and it is not unavoidable. It is a missing property of the kernels you run, it has a measured cost, and the engineering to close it is already in the inference engines you depend on. The remaining question is whether your team treats a result it cannot reproduce as acceptable — because your auditors, your customers, and your training stability will eventually answer it for you.

  • If you self-host to escape API nondeterminism entirely, the NVIDIA NIM economics breakdown maps where self-hosting actually beats every managed API on cost.
  • Batch-invariant kernels are one optimization; CUDA graphs plus torch.compile stack on top for a 1.65× decode speedup you can run alongside them.

References