Prefill-Decode Disaggregation: NVIDIA’s 7x Inference Fix

Every LLM inference request is two workloads pretending to be one. Prefill processes your entire prompt in a single compute-bound forward pass — it wants raw FP8 TFLOPS. Decode generates tokens one at a time by streaming KV cache tensors through memory — it wants HBM bandwidth. Running both on …