Agent Cache Rebuilds Waste 38% GPU
When researchers at the University of Hong Kong instrumented a 32-GPU A100 cluster running SWE-bench coding agents on vLLM v0.6.0, they found a number that should bother every platform engineer: 38% of total execution time was spent regenerating KV cache that had been discarded between agent steps. Not computing new tokens. Not waiting on tools. Rebuilding state the system already computed and then threw away. End-to-end latency sat at 6x the theoretical minimum — six times slower than the sum of individual inference times.
This isn’t a vLLM bug. It’s a structural mismatch. Current GPU schedulers treat every LLM call as an independent request. AI agents make 10 to 100 chained calls per task, each depending on the previous step’s output and interleaved with tool invocations that pause execution for 50ms to 30+ seconds. During those pauses, the KV cache — which for a 32K-context session with a 70B-parameter model consumes 2–12 GB of GPU memory per request — gets evicted by standard LRU policies that have no idea the agent will need it again in two seconds.
The paper, SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters, accepted at HPDC ’26, proposes treating the entire agent workflow as the schedulable unit rather than individual inference calls. On a 64-GPU cluster, SAGA reduces task completion time by 1.64x (geometric mean, p < 0.001) over vLLM v0.15.1 with prefix caching and affinity routing. The trade-off: approximately 30% lower peak throughput, which the authors argue is the right call for latency-sensitive interactive deployments.
Why Request-Level Scheduling Fails Agents
LLM serving frameworks — vLLM, SGLang, TensorRT-LLM — were built for single-shot inference. They optimize time-to-first-token and throughput for independent requests using continuous batching and PagedAttention. These are the right abstractions for chatbots and completion APIs. They break down for compound AI workloads.
Agent workloads exhibit three characteristics that violate request-level assumptions:
- Sequential dependency with variable gaps. Each reasoning step depends on the previous step’s output and potentially on external tool results. Tool invocations introduce idle periods ranging from 50ms (local code execution) to 30+ seconds (web API calls), during which the agent’s intermediate state must be preserved or regenerated.
- KV cache continuity across steps. Discarding cache between steps forces complete regeneration, adding 2–8x latency overhead per step. Production traces show 100:1 input-to-output token ratios and high prefix overlap within sessions.
- Bursty, correlated request patterns. Agent tasks generate bursts of related requests that share common prefixes — system prompts, tool definitions — and benefit from co-location on the same GPU.
The instrumented cluster showed GPU memory utilization averaging only 42% due to fragmented cache allocation. You’re paying for A100s and using less than half their memory effectively. This compounds the cost problem we covered in agentic AI workflows costing 5x more than budgeted.
Agent Execution Graphs Predict Cache Reuse
SAGA’s first mechanism is the Agent Execution Graph (AEG). Instead of treating each LLM call as opaque, SAGA captures the workflow structure — the pattern of Thought-Action-Observation loops — to predict which KV cache blocks will be needed next.
The key insight: agent workflows follow recurring patterns. A coding agent reads a file, edits it, runs tests, reads the error, and repeats. SAGA uses pattern-based inference to predict these sequences and retain the right cache blocks across tool-call boundaries. The result: its eviction policy achieves within 1.31x of Bélády’s optimal offline policy — the theoretical best possible cache eviction you could do if you had perfect future knowledge.
This is a meaningful improvement over LRU eviction. Standard LRU has no concept of “this cache block belongs to a session that will resume in 500ms.” SAGA does. It applies tool-call-aware TTLs — if the agent is waiting on a fast local tool execution, the cache stays hot. If it’s waiting on a 30-second web scrape, SAGA can make informed decisions about whether to keep or evict, knowing the workflow structure tells it when the session will resume.
Session-Affinity Batching With Work Stealing
The second mechanism addresses a subtler problem. Agent sessions generate correlated request bursts. If session A’s requests land on GPU 0 and GPU 3 across different calls, you lose prefix-sharing opportunities and fragment cache across devices. Session-affinity routing co-locates correlated requests on the same GPU.
The problem with naive affinity is load imbalance. If one agent is running a 50-step debugging session and another completes in 3 steps, you end up with hot GPUs and idle ones. SAGA adds work stealing — idle GPUs can steal requests from overloaded ones — to maintain global load balance while preserving locality where it matters.
This matters because inference now accounts for 70–80% of total GPU cloud spend for production teams. If you can’t co-locate sessions effectively, you’re paying to duplicate prefix computation across GPUs that don’t need to be doing it.
Agent Fair Share Scheduling
The third mechanism targets multi-tenant environments. Traditional fair-share schedulers allocate equal GPU time to each tenant. But in agent workloads, equal time doesn’t mean equal progress. A tenant running simple retrieval-augmented generation tasks completes many tasks per minute. A tenant running complex multi-step coding agents completes few. If you allocate equal GPU time, the coding agent tenant gets starved on task completions.
SAGA introduces Agent Fair Share (AFS), a task-completion-time fairness metric with provable bounded-deviation guarantees. Instead of balancing GPU-milliseconds, it balances task completions. The scheduler ensures no tenant’s task completion rate deviates beyond a bounded factor from what they’d get in a fair allocation. The paper provides formal guarantees on this bound.
In the evaluation, SAGA achieved 99.2% SLO attainment under multi-tenant interference on the 64-GPU cluster. That’s not a toy benchmark — the workloads are SWE-bench coding agents and WebArena browser tasks, both representing real agent usage patterns.
The 30% Throughput Trade-Off
SAGA is not free. The paper is transparent about the cost: approximately 30% lower peak throughput compared to throughput-optimal batch scheduling. This is the inherent tension between latency-optimized and throughput-optimized serving.
For batch workloads — processing a queue of documents, running evaluations, generating embeddings — you want maximum throughput. SAGA is not the right tool. For interactive agent deployments where a user is waiting for a response — GitHub Copilot, Amazon Q, enterprise automation — the 1.64x latency reduction matters more than raw throughput. The Scalable Inference Architectures for Compound AI Systems study from April 2026 corroborates this: compound AI systems are dominated by interactive, latency-sensitive deployments.
The practical implication for platform teams: you probably need both. A throughput-optimized pool for batch inference and a latency-optimized pool for interactive agents. SAGA gives you the architecture for the second one. Our earlier analysis of DRA-based K8s AI scheduling addressed the infrastructure layer; SAGA operates above it.
What This Means for Your Infrastructure
If you’re running agent workloads in production, three numbers from this paper should change how you think about GPU allocation:
- 6x latency overhead from request-level scheduling. If your agent feels slow, the bottleneck might not be the model — it might be the scheduler throwing away cache between steps.
- 42% GPU memory utilization. You’re paying for 100% of your GPUs and using less than half. Workflow-aware scheduling pushes this to 71%.
- 1.64x task completion improvement without changing hardware or models. This is pure scheduling efficiency.
The SAGA scheduler is not yet production-available as an open-source project, but the design patterns are implementable today. Start by enabling prefix caching in vLLM (APC mode), route agent sessions to the same GPU replica when possible, and measure your KV cache hit rates between agent steps. If you see what the paper describes — high regeneration rates and low memory utilization — the workflow-atomic scheduling pattern is worth implementing. Pair this with techniques from our speculative decoding for MoE inference for compounded gains.
The Kubernetes AI infrastructure landscape in 2026 is converging on DRA-based GPU scheduling and Kueue-based quota management. But these solve the infrastructure layer. SAGA operates at the workload layer — understanding that agent steps are not independent requests but stages of a single program. That’s the abstraction shift that matters.
References
- SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters — Guo, Wu, Yiu. HPDC ’26, May 2026.
- vLLM Optimization and Tuning Documentation — vLLM Project, 2026.
- AI Inference Cost Economics in 2026: GPU FinOps Playbook — Spheron Network, 2026.
- Scalable Inference Architectures for Compound AI Systems — Production Deployment Study, April 2026.
- Kubernetes AI Infrastructure in 2026: GPU Scheduling & Production Realities — CloudOptimo, 2026.
- KV Cache Optimization: Memory Efficiency for Production LLMs — Introl, 2026.