CUDA Graphs + torch.compile: 1.65x LLM Decode Speedup

A single decode step for Llama 3.1 8B on an H100 SXM5 takes 8.4 milliseconds in eager mode. Capture that same forward pass as a CUDA graph and it drops to 5.1 milliseconds — a 1.65× speedup. The reason is not faster math; it is the elimination of CPU-side kernel-launch …