A single Llama 3.1 70B request at 128K context consumes roughly 42.9 GB of GPU VRAM just for its KV cache at BF16 precision
That’s more than half an H100 80GB — before you account for model weights, activation memory, or the fact that you’d like to serve more than one user at a time. The formula is straightforward: 2 × 80 layers × 8 KV heads × 128 head_dim × 131,072 tokens × 2 bytes ≈ 42.9 GB. The KV cache is no longer a secondary concern in your inference stack — it is the dominant cost driver, and in 2026, every serious production deployment treats it as a first-class engineering problem.
This isn’t theoretical. Teams running multi-tenant LLM services with long context windows are discovering that their GPU utilization curves are dominated by KV cache allocation, not compute. The question has shifted from “how do I shard this model across GPUs” to “how do I stop the KV cache from eating my entire memory budget.”
Why the KV cache dominates production inference cost
The memory bandwidth wall is the defining constraint of LLM inference in 2026. At low batch sizes — which is where most production chat and RAG workloads live — your GPU isn’t compute-bound. It’s waiting on memory. The autoregressive decode phase is sequential: each new token requires attending to every previous token’s key and value vectors. This makes decode memory-bandwidth bound, not compute-bound. You can’t throw more FLOPS at the problem.
The cache grows linearly with sequence length. A conversation that starts at 2K tokens and extends to 32K tokens means your KV cache footprint increased 16x while your model weights stayed the same. For multi-turn agents processing documents, codebases, or long RAG context, this is where your bill comes from. A large model at 8K context and batch size 32 can reach tens to hundreds of gigabytes of KV cache memory — on the same order as, or larger than, the model weights themselves according to recent inference studies published by Redis.
PagedAttention: the table stakes you should already have
vLLM’s PagedAttention is the baseline. If your inference server isn’t using it in 2026, you’re leaving 2-4x throughput on the table for free. The concept mirrors OS virtual memory: instead of pre-allocating contiguous memory for each request’s maximum possible context length, the KV cache is divided into fixed-size blocks (16 tokens each in vLLM’s default configuration) and allocated on-demand as tokens are generated.
Traditional implementations reserve the maximum context window per request upfront — a single Llama 70B request might claim 42.9 GB of VRAM even if it only uses 4K tokens. PagedAttention eliminates this waste. Blocks are allocated incrementally during prefill and decode, and freed immediately when a request completes. The practical result is 2-4x more concurrent requests on the same GPU with zero impact on model quality as documented in Spheron’s 2026 KV cache engineering guide.
Every major inference engine — vLLM, TensorRT-LLM, TGI, SGLang — has adopted some form of paged attention. The question is no longer whether to use it, but what you layer on top.
KV cache quantization: BF16 → FP8 → FP4, and the real savings
Quantizing the KV cache is the fastest path to meaningful memory reduction without architectural changes. The math is brutal in its simplicity: halving the precision halves the memory. A Llama 3.1 70B request at 128K context drops from 42.9 GB at BF16 to roughly 21.5 GB at FP8 and approximately 10.7 GB at FP4 according to Spheron’s VRAM calculations.
FP8 KV cache quantization is production-ready on H100 and A100 via vLLM’s --kv-cache-dtype fp8 flag. The quality impact is typically negligible for most tasks — less than 1% degradation on standard benchmarks at FP8 — because key and value vectors are less sensitive to precision reduction than weights or activations.
NVFP4, NVIDIA’s 4-bit floating-point format, is the new frontier, but it’s Blackwell-only. The B200, B300, RTX 5090, and RTX PRO 6000 support hardware-accelerated NVFP4 operations. On Hopper (H100), you can load NVFP4 model weights via Marlin’s software fallback, but KV cache FP4 is not hardware-accelerated, so you lose the throughput advantage. The right move: use --kv-cache-dtype fp8 on H100/A100, and --kv-cache-dtype nvfp4 on Blackwell.
A recent arXiv survey on KV cache optimization strategies covers advanced quantization techniques including per-channel key quantization, pre-RoPE quantization, and sensitivity-weighted non-uniform quantization — methods that squeeze additional efficiency by exploiting the statistical properties of key vs. value vectors differently. KIVI, for instance, applies per-channel quantization to the key cache and per-token quantization to the value cache, maintaining a small FP16 buffer to preserve critical attention patterns.
Prefix caching: where the real ROI lives for RAG and agents
If your workload involves repeated system prompts, shared document context, or multi-turn conversations, prefix caching is your highest-leverage optimization. Instead of recomputing the KV cache for identical prefix tokens on every request, the system stores and reuses cached KV blocks.
vLLM has built-in automatic prefix caching. LMCache extends this further by persisting cached KV blocks across server sessions, restarts, and multiple server instances using Redis or disk backends. The numbers are compelling: on a 128K-token system prompt running on H100, LMCache demonstrated a reduction in time-to-first-token (TTFT) from 11 seconds to 1.5 seconds by eliminating the redundant prefill computation.
The pattern is clear: for RAG pipelines where dozens of users query the same document corpus, or agentic systems with long system prompts, prefix caching turns your most expensive operation — prefill — into a memory lookup. Production teams report 85-95% prefix hit rates on shared-context workloads, which translates directly to lower GPU hours and reduced latency. Digital Applied’s engineering guide notes that every 2026 production inference stack has paged attention by default — the real differentiator is the caching layer on top.
CPU offloading and hybrid memory: when you’re VRAM-constrained
Not everyone has Blackwell GPUs. For teams running on A100 80GB or consumer hardware, CPU offloading provides a viable escape hatch. The approach is conceptually simple: store less frequently accessed KV cache pages in system RAM and stream them to GPU VRAM on demand via PCIe or NVLink. The latency penalty is real — system RAM bandwidth is 10-100x lower than HBM — but for low-concurrency, long-context workloads (think single-user coding assistants or batch document processing), it’s often acceptable.
The trade-off curve is steep. On an A100 80GB serving Llama 3.1 70B at moderate context lengths (up to ~32K), FP8 KV quantization combined with PagedAttention can keep everything in VRAM for small batch sizes. But pushing to 128K context requires either multi-GPU tensor parallelism or CPU offloading. The decision hinges on your latency budget: if TTFT under 2 seconds is a requirement, stay in VRAM. If you can tolerate 5-10 second TTFT for a batch job, CPU offloading with NVMe-backed swap is cost-effective.
Semantic caching: cutting costs before the model runs
Semantic caching attacks the problem from a different angle. Instead of optimizing how the KV cache is stored, it eliminates the need to compute it at all for semantically similar queries. Redis’s analysis of production LLM workloads notes that a meaningful portion of queries are semantically similar to ones already answered — you’re paying for computations you’ve already performed.
Semantic caching systems like Redis LangCache use embedding similarity to match incoming queries against cached responses, returning cached results when the similarity score exceeds a threshold. For customer support bots, FAQ systems, and any workload with repetitive query patterns, this can reduce inference costs by 30-60% with minimal quality impact. The key engineering decision is the similarity threshold: too aggressive, and you return stale or irrelevant responses; too conservative, and your cache hit rate collapses. Production systems typically tune this per use case, with stricter thresholds for factual Q&A and more aggressive caching for conversational chitchat.
What the 2026 production stack looks like
Based on current tooling and hardware availability, a well-optimized production inference stack in 2026 layers these techniques in this order:
- PagedAttention — mandatory baseline via vLLM, TensorRT-LLM, or SGLang. Eliminates pre-allocation waste.
- FP8 KV quantization on Hopper, NVFP4 on Blackwell. Halves or quarters your KV cache memory with negligible quality loss.
- Prefix caching for any workload with shared context. LMCache for cross-session persistence. This is where the TTFT improvements live.
- Semantic caching at the application layer for repetitive query patterns. Cuts total inference volume, not just per-request cost.
- CPU offloading only as a last resort for VRAM-constrained environments with relaxed latency requirements.
Frequently Asked Questions
How much GPU memory does KV cache use for Llama 3.1 70B at 128K context?
At BF16 precision, a single request consumes approximately 42.9 GB for the KV cache alone. With FP8 quantization this drops to roughly 21.5 GB, and with NVFP4 on Blackwell hardware to about 10.7 GB. On a single H100 80GB, serving multiple concurrent users at 128K context requires aggressive quantization plus multi-GPU setup.
What is PagedAttention and does it affect model quality?
PagedAttention applies OS-style virtual memory paging to the KV cache, allocating fixed-size blocks on-demand instead of pre-reserving the maximum context length. It delivers 2-4x more concurrent requests on the same GPU with zero impact on model quality — the same tokens are computed, just stored more efficiently.
When should I use CPU offloading vs. prefix caching?
Use prefix caching when your workload has repeated system prompts or shared document context — it eliminates redundant computation entirely. CPU offloading is a last resort for VRAM-constrained deployments (like A100 80GB with 70B models at long context) where you can tolerate higher TTFT. If your latency budget is under 2 seconds, avoid CPU offloading.
Is NVFP4 KV cache quantization production-ready?
Yes on Blackwell GPUs (B200, B300, RTX 5090). On H100 and A100, NVFP4 KV cache operations are not hardware-accelerated, so you lose the throughput advantage. Use FP8 quantization on Hopper GPUs instead — the quality-to-performance ratio is better.
References
- KV Cache Optimization Strategies for Scalable and Efficient LLM Inference — arXiv 2603.20397
- KV Cache Optimization: Serve 10x More Users on the Same GPU (2026) — Spheron
- How to Optimize Machine Learning Inference Costs and Performance — Redis
- KV Cache Optimization for LLMs 2026: Engineering Guide — Digital Applied
- Fastest LLM Inference (2026): GPU Speed vs Cost Per Token — Yotta Labs