Google TPU v8 Puts KV Cache on Silicon to Cut Inference Cost

Google Put KV Cache on Silicon

Google’s TPU 8i triples on-chip SRAM to 384 MB and crams 288 GB of HBM onto a single chip — enough to host massive KV caches entirely in silicon, bypassing the memory wall that has bottlenecked LLM inference since the transformer era began. The result: 80% better performance-per-dollar for inference compared to the previous TPU generation, and a dedicated Collectives Acceleration Engine that cuts on-chip latency by up to 5x.

This isn’t a incremental spec bump. Google bifurcated its entire TPU line — TPU 8t for training, TPU 8i for inference — because agentic workloads demand fundamentally different hardware profiles than batch training. If you’re running LLM inference at scale, this architecture decision deserves your attention.

Two Chips, Two Problems

For seven generations, Google built general-purpose TPUs that tried to serve both training and inference. TPU v8 abandons that compromise. The training-focused TPU 8t packs 9,600 chips into a single superpod, delivering 121 exaflops of compute and two petabytes of shared memory via high-speed inter-chip interconnect (ICI). The inference-focused TPU 8i takes a different path: massive on-chip memory to keep KV caches local, doubled ICI bandwidth at 19.2 Tb/s, and a reduced network diameter by over 50%.

Why split? Because training and inference have opposite optimization targets. Training wants maximum FLOPs across distributed nodes with elastic batching. Inference — especially agentic inference — wants minimum latency per token across long context windows with unpredictable request patterns. A single architecture can’t serve both well.

As Google’s Amin Vahdat and Mark Lohmeyer noted, the agentic era means “a single intent triggers a chain reaction” where primary agents decompose goals into tasks for specialized fleets that “collaborate, preserve state, and use reinforcement learning to deliver outcomes in real-time.” That workload profile makes memory bandwidth, not compute, the binding constraint.

The KV Cache Memory Wall

Anyone who has operated LLM inference at scale knows the drill: KV cache grows linearly with sequence length and quadratically with batch size. A single request with a 128K context window can consume gigabytes of memory. When you’re running thousands of concurrent agentic sessions — each maintaining state across multi-turn tool calls — KV cache becomes the dominant cost driver, often exceeding 70% of GPU memory allocation.

The standard mitigation has been software-level: PagedAttention (vLLM), radix attention (SGLang), KV cache quantization, and prefix caching. These help, but they’re workarounds for a fundamental hardware limitation — the memory bandwidth wall between HBM and compute cores.

TPU 8i attacks the problem at the silicon level. By tripling on-chip SRAM to 384 MB, Google creates enough fast memory to keep active KV cache entries on-chip for typical inference patterns, avoiding repeated round-trips to HBM. The dedicated Collectives Acceleration Engine (CAE) handles the synchronization primitives that distributed inference requires, reducing the overhead of multi-chip KV cache lookups.

KV Cache: Hardware vs. Software Optimization

ApproachMechanismLimitation
PagedAttention (vLLM)Virtual memory for KV blocksStill bounded by HBM bandwidth
KV Cache QuantizationCompress cache entries (FP8/INT4)Quality degradation at scale
Prefix CachingReuse shared prompt prefixesOnly helps with repeated system prompts
TPU 8i On-Chip SRAMHost active KV cache in 384 MB SRAMRequires TPU infrastructure; not portable
CAE AccelerationHardware-synced distributed cacheLocked to Google’s interconnect fabric

What Agentic Inference Actually Demands

The Nutanix 2026 Enterprise Cloud Index identifies a key shift: enterprises are moving from “AI-first” (ship fast, govern later) to “AI-smart” (reliability over speed). Agentic workflows are the pressure point. A single agent call spawns multiple sub-agent calls, each with its own context window, each requiring its own KV cache allocation.

Consider a production scenario: an AI SRE agent that correlates telemetry, investigates incidents, and executes bounded remediation. One user intent — “diagnose the latency spike in us-east-1” — might trigger:

  • A log-analysis agent scanning 10M+ entries (long context window)
  • A metrics agent querying time-series data across 50+ services
  • A topology agent traversing service dependency graphs
  • A remediation agent proposing and executing fixes

Each agent maintains its own KV cache. The total memory footprint for a single incident investigation can easily exceed what a single GPU can provide — and the latency requirements (sub-second token generation) mean you can’t afford to spill to host memory or reconstruct contexts on the fly.

This is exactly the workload TPU 8i was designed for: many concurrent, long-context, latency-sensitive inference requests that share underlying infrastructure. The 80% performance-per-dollar improvement isn’t from faster chips — it’s from keeping more of the working set in the fastest memory tier.

The Vendor Lock-In Question

Here’s the uncomfortable engineering trade-off. Google’s approach is technically elegant but architecturally binding. The TPU 8i’s advantages depend on the entire stack: the ICI fabric, the CAE, the Pathways orchestrator, and Google’s custom JAX/PyTorch integration. You can’t replicate this with off-the-shelf NVIDIA hardware.

Google is also offering A5X bare metal instances with NVIDIA Vera Rubin NVL72 for customers who want NVIDIA’s stack, plus Axion N4A VMs on custom Arm CPUs for inference workloads that don’t need TPU-level performance. The message: use our best hardware (TPU 8i) for your most demanding inference, but we won’t force you off NVIDIA entirely.

The practical question for infrastructure teams isn’t whether TPU 8i is faster — the specs suggest it is. The question is whether the performance gain justifies committing to Google’s inference stack for the next 3-5 years, given that model specialization across providers is accelerating. Anthropic optimizes for coding. Google for general-purpose. Amazon for data-sovereign workloads. Your inference infrastructure needs to support all of them.

What Engineers Should Do Now

If you’re operating LLM inference in production today, here’s how to contextualize TPU v8 against your actual infrastructure decisions:

  1. Profile your KV cache pressure first. Before evaluating hardware changes, measure what percentage of your inference latency comes from KV cache reads versus compute. If it’s under 40%, TPU 8i’s SRAM advantage is marginal for your workload. If it’s over 60%, start benchmarking.
  2. Separate training and inference infrastructure. Google’s bifurcation validates what many teams already suspected: the optimal hardware for training and inference is fundamentally different. Stop sharing GPU pools between training and serving — the scheduling conflicts and memory fragmentation alone cost you 20-30% utilization.
  3. Design for model portability. The Pulumi 2026 predictions argue that “you’ll need infrastructure that’s model-agnostic and supports multiple AI backends.” This is correct. Build an abstraction layer (LiteLLM, OpenRouter, or a custom router) that lets you route inference requests to the optimal backend per model. TPU 8i for Gemini workloads, H100/B200 for OpenAI/Anthropic, local inference for low-latency edge cases.
  4. Watch the open-source counterweight. The same week Google announced TPU v8, AWS announced Trainium3 UltraServers with 3nm chips delivering 4.4x more compute than the previous generation, plus AI Factories for on-premises deployment. The inference hardware market is heating up fast — don’t commit to a single vendor before the next hardware cycle.
  5. Measure total cost of inference, not per-token price. Google claims 80% better performance-per-dollar, but that’s against their own previous generation on their own workloads. Your cost equation includes network egress, data residency requirements, cold-start latency for multi-region deployments, and the engineering cost of maintaining a heterogeneous inference fleet.

The Bigger Picture

The TPU v8 split signals something important about where AI infrastructure is heading. The era of general-purpose AI accelerators is ending. Training, inference, and edge AI have divergent enough requirements that purpose-built silicon wins. Google built TPU 8t for one job and TPU 8i for another because the compromise was leaving too much performance on the table.

For engineers running production AI systems, the actionable insight isn’t “switch to TPU 8i.” It’s that the memory hierarchy for LLM inference is being redesigned from the silicon up. KV cache — the operational bottleneck every inference team fights — now has hardware dedicated to solving it. That changes the economics of long-context, multi-agent workloads in ways that software optimization alone never could.

The teams that will benefit most are those already running agentic inference at scale and hitting the KV cache wall daily. For everyone else, the play is to build portable inference infrastructure that can adopt purpose-built hardware as it becomes available — without being locked into a single vendor’s stack.

References