CUDA Graphs + torch.compile: 1.65x LLM Decode Speedup

A single decode step for Llama 3.1 8B on an H100 SXM5 takes 8.4 milliseconds in eager mode. Capture that same forward pass as a CUDA graph and it drops to 5.1 milliseconds — a 1.65× speedup. The reason is not faster math; it is the elimination of CPU-side kernel-launch overhead that silently consumes 20–30 percent of step time at small batch sizes (Spheron, 2026). Most inference teams never profile this layer.

The CPU Bottleneck Nobody Profiles

When PyTorch launched in 2016, its core assumption was CPU/GPU overlap: the Python program on the CPU dispatches GPU kernels faster than the GPU executes them, so the CPU never becomes the bottleneck. That assumption has aged badly. Half-precision floating-point throughput on NVIDIA GPUs has increased roughly 47× since the GP100, while memory bandwidth grew 4.6× over the same window (Fireworks AI). The GPU got dramatically faster. The host CPU did not.

Every PyTorch operation passes through four CPU-side layers before a single thread runs on the GPU. First, Python metaprogramming logic — loops over layers, conditional branches — executes on the interpreter. Second, the PyTorch dispatcher inspects tensor properties (dtype, device, whether autograd is active) to select which compute kernel to call. Third, the caching memory allocator may service or defer a CUDA malloc. Fourth, the CUDA driver itself validates arguments and enqueues the kernel (Fireworks AI).

A single forward pass through a 7B transformer issues thousands of these small kernel launches (Spheron, 2026). At batch size 1, each decode step is almost entirely memory-bandwidth-bound — the GPU reads the full weight set from HBM to compute one small matrix multiply — and the actual compute finishes in microseconds. The CPU spends longer dispatching the kernels than the GPU spends running them. This is the regime where graph capture delivers its largest gains.

How CUDA Graphs Actually Work

A CUDA graph records every kernel launch, memory allocation, and synchronization event in a forward pass during a warm-up capture phase. On subsequent steps, the entire sequence replays as a single cudaGraphLaunch call from the CPU — one submission instead of thousands (Thomas, 2025). The GPU receives a pre-built command buffer and executes it without waiting for the host.

The capture is rigid: it records exact tensor addresses and shapes. If either changes between steps, the graph is invalid and must be re-captured from scratch. This static-shape requirement is the central engineering constraint of CUDA graphs in production, and it is why inference frameworks invest heavily in shape management rather than in the capture mechanism itself.

vLLM addresses this with a CUDA Graphs dispatcher — a central controller that selects the appropriate pre-captured graph per batch size automatically, and makes graph support orthogonal to the compilation pipeline (vLLM Documentation). The server pre-captures graphs during a warm-up pass at startup, covering a discrete set of batch-size buckets rather than every possible configuration.

torch.compile’s Three Modes

PyTorch’s torch.compile wraps two compilation mechanisms that combine to enable graph capture. First, the Inductor backend performs kernel fusion: it identifies adjacent elementwise operations, matrix multiplies, and normalization layers that can execute as a single fused kernel instead of a chain of separate launches. Second, the compiled graph becomes stable enough — no Python-side control flow variation, no dynamic allocation — that a CUDA graph can capture it (Spheron, 2026).

Three compilation modes matter in practice:

ModeWhat it doesWhen to use
defaultFull Inductor optimization, no CUDA graph captureFirst pass when debugging graph breaks
reduce-overheadInductor optimization + CUDA graph captureProduction inference at fixed batch sizes
max-autotuneExhaustive kernel search + CUDA graphLatency-critical, offline benchmark setup

For most LLM serving workloads, reduce-overhead is the correct default. The first call takes 30–90 seconds for kernel compilation; every subsequent call runs at near-hardware-peak throughput (Spheron, 2026). SGLang exposes this through --enable-torch-compile with a --torch-compile-max-bs ceiling, and toggles graph capture separately with --cuda-graph-max-bs and --disable-cuda-graph (Verda/DataCrunch, 2025).

The Static Shape Problem

Fixed tensor shapes are ideal for CUDA graphs. Production traffic is not fixed. Two strategies dominate:

Bucketed padding. Round every input up to the nearest bucket boundary — 512, 1024, 2048, 4096 tokens — and pre-capture a separate graph per bucket. The padding waste is the cost; the win is zero recompilation after warm-up and full CUDA graph compatibility. For latency-critical serving where you control request grouping, this is the standard choice (Spheron, 2026).

Dynamic shapes. Pass dynamic=True to torch.compile. Dynamo generates shape guards — assert 0 <= seq_len <= 4096 — instead of exact-match checks. Within the valid range, the same compiled graph runs without recompilation. The trade-off: no CUDA graph capture, so you keep per-step dispatch overhead, and each guard check adds a small constant cost (Spheron, 2026).

ApproachRecompilationRuntime overheadCUDA graph compatible
Fixed bucketNone after warm-upPadding wasteYes (one graph per bucket)
Dynamic shapesOn range changeSmall guard checkPartial — no graph
Fully dynamicEvery new lengthFull recompile costNo

What PyTorch 2.6 Changed

PyTorch 2.6 landed three changes that matter specifically for LLM inference compilation. Regional compilation lets you call torch.compile on individual submodules — the attention and MLP blocks — while leaving custom preprocessing or embedding logic in eager mode. This matters when parts of the model use hand-written CUDA ops that Dynamo cannot trace (Spheron, 2026).

Symbolic shapes via Dim.AUTO improve torch.export support for dynamic dimensions. Marked dimensions generate guards over a shape range rather than exact values, reducing recompilation storms across sequence lengths within a defined interval.

Custom op registration through torch.library.custom_op is the recommended path for integrating kernels Dynamo cannot trace — FlashAttention, custom RoPE, hand-tuned GQA variants. Registration creates an opaque boundary: Dynamo stops tracing at the kernel call and emits a graph break, then compiles the surrounding linear projections and normalization layers into fused Inductor kernels while the custom op runs through its own CUDA or Triton pipeline (Spheron, 2026). The 2.6 release also significantly reduced graph-break rates for standard transformer patterns — nn.MultiheadAttention, RoPE, and grouped-query attention — compared to 2.4 and 2.5.

Benchmarks: Where the Speedup Lands

The gain is not uniform across batch sizes. On Llama 3.1 8B, single H100 SXM5, decode phase only:

Batch sizeEager decode (ms/step)CUDA graph (ms/step)Speedup
18.45.11.65×
49.26.01.53×
1612.19.31.30×
3218.716.11.16×

Source: Spheron, April 2026, Llama 3.1 8B bf16 on H100 SXM5

The pattern is clear: as batch size grows, compute time begins to dominate dispatch overhead, and the graph speedup shrinks. CUDA graphs deliver the largest wins exactly where they matter most for user-facing latency — low-concurrency, single-stream serving. Fireworks AI measured a 2.3× end-to-end speedup on LLaMA v2-7B inference with CUDA graphs, reaching 69 tokens/s at batch size 1 on an A100 (Fireworks AI).

FlashAttention Integration Done Right

FlashAttention kernels are written in CUDA and Triton, and Dynamo cannot trace into them. The correct integration is to register the kernel as a custom op with torch.library.custom_op. One production gotcha: use an application-specific namespace such as myapp:: rather than flash_attn::. FlashAttention 2.x+ ships its own op registration under that namespace, and re-registering under it raises RuntimeError: Trying to define a custom op with the same name as an existing op (Spheron, 2026). With proper registration, Dynamo compiles the surrounding projections and normalization while calling FlashAttention as an opaque function — you get fused kernels everywhere Dynamo can trace, and hand-tuned CUDA everywhere it cannot.

When Graphs Are the Wrong Tool

CUDA graphs have a cost profile that does not fit every workload. If your batch size varies unpredictably across a wide range and you cannot bucket, the warm-up capture cost for many shapes can exceed the latency savings. Dynamic input shapes that change every request force full recompilation on each call, making graphs worse than eager mode. Training workloads with backward passes and gradient accumulation are poor candidates — graph capture is optimized for the repeating decode forward pass, not the stochastic gradient computation graph. And if your serving backend already uses vLLM or SGLang, both frameworks ship graph capture enabled by default; adding a manual torch.compile wrapper on top can conflict with the framework’s own dispatcher and produce silent correctness bugs (vLLM Documentation).

The practical test: profile with torch.profiler or Nsight Systems. If your CPU-side kernel-launch overhead is under 5 percent of step time — typically because you run at high batch size or your model is compute-bound — graphs will not help. If it is 15–30 percent, which is the common case for batch-size-1 to batch-size-8 serving, graph capture is the single highest-ROI optimization you have not applied yet.

References