Inference Is Two-Thirds of AI Compute
Deloitte’s 2026 TMT Predictions estimates inference accounts for roughly two-thirds of all AI compute this year, a structural shift that has fractured the cloud market into six distinct provider categories. NVIDIA’s Blackwell and GB200 architectures have flooded the market with new GPU options, and the question “which cloud for my AI workload” no longer has a three-hyperscaler answer. Platform teams now face dozens of sub-decisions across training, fine-tuning, real-time serving, and agentic orchestration.
The Six AI Cloud Categories
A working taxonomy has emerged from the fragmentation. Each category optimizes for a different workload profile, and most production teams end up spanning at least two or three.
| Category | Key Providers | Strength | Weakness | Best For |
|---|---|---|---|---|
| Traditional Hyperscalers | AWS, Azure, GCP, Oracle | Ecosystem depth, compliance | Higher per-GPU cost, slow provisioning | Regulated enterprise, hybrid |
| Neoclouds | CoreWeave, Lambda, Nebius, Crusoe | Bare-metal GPU perf, K8s-native | Limited non-GPU services | Frontier training, large-scale inference |
| Developer Clouds | DigitalOcean, Vultr, Hyperstack | Transparent pricing, fast onboarding | Smaller GPU fleets | Prototyping, mid-market |
| Inference Platforms | Fireworks, Groq, Cerebras, Together | Ultra-low latency, purpose-built serving | No training, model constraints | Real-time agents, chatbots |
| GPU Marketplaces | Vast.ai, TensorDock, RunPod | Lowest hourly prices, broad hardware | Variable reliability, weak SLAs | Experimentation, batch work |
| Orchestration Layers | BentoML, SkyPilot, Anyscale | Provider-agnostic, portable | Abstraction overhead | Multi-cloud, workload migration |
Neoclouds Are the Breakout Category
The numbers are staggering. CoreWeave went public in March 2025 and carries a contracted revenue backlog of $66.8 billion as of December 2025, up from $55.6 billion just one quarter prior. NVIDIA invested $2 billion in January 2026 to accelerate CoreWeave’s buildout toward 5 GW of AI factories by 2030. Lambda Labs closed a $1.5 billion round in late 2025 and is deploying GB300 NVL72 GPUs into Azure through a Microsoft partnership.
Nebius signed a $3 billion deal with Meta in February 2026 and targets 800 MW to 1 GW of connected capacity by year-end. Their contracted backlog is growing faster than they can provision hardware — CEO Arkady Volozh noted that demand for Blackwell capacity exceeded supply, and deal sizes were limited only by available infrastructure.
SemiAnalysis ClusterMAX 2.0 ratings put Nebius, Oracle, Azure, and CoreWeave in the top gold tier for infrastructure quality, a validation that neoclouds now compete directly with hyperscalers on reliability — not just price.
Oracle and the Stargate Factor
Oracle’s positioning deserves special attention because it breaks the assumption that only NVIDIA-native clouds win at scale. The Stargate Initiative, a joint venture with OpenAI and SoftBank announced in January 2025, plans nearly 7 gigawatts of AI data center capacity across multiple US sites. The flagship campus in Abilene, Texas is already operational, deploying up to 450,000 GB200 superchips.
OCI Superclusters scale to 131,072 NVIDIA GPUs per cluster as of mid-2025, underpinned by the Zettascale10 architecture. Oracle’s strategy is essentially “become the physical layer for frontier model builders” — and OpenAI is both the anchor tenant and co-investor. AWS is expected to approach $200 billion in 2026 CapEx, but Oracle’s focused GPU density per cluster is hard to match.
Google’s Silicon Gambit: TPU v8
At Google Cloud Next ’26 in April, Google introduced two distinct TPU v8 chips optimized for different workloads. The TPU 8t packs 9,600 chips in a single superpod delivering 121 exaflops and 2 petabytes of shared memory, designed for high-throughput training. The TPU 8i triples on-chip SRAM to 384 MB and increases HBM to 288 GB to host massive KV caches entirely on silicon — a direct answer to the inference bottleneck.
The 8i also doubles ICI inter-chip bandwidth to 19.2 Tb/s and includes a Collectives Acceleration Engine that cuts on-chip latency by up to 5x. Google claims 80% better performance per dollar for inference compared to the prior generation. For teams already running on GKE, the TPU 8i is positioned as the cost-optimal path for high-volume Gemini or open-model inference without vendor lock-in to NVIDIA’s pricing cycle.
Google also announced A5X bare-metal instances powered by NVIDIA’s upcoming Vera Rubin NVL72 platform, co-engineering the open-source Falcon networking protocol with NVIDIA through the Open Compute Project.
CNCF Conformance and llm-d
The fragmentation problem has a standards answer. The CNCF launched its Kubernetes AI Conformance Program in November 2025, and by KubeCon EU in March 2026 it had nearly doubled certified platforms from 18 to 31. The program now validates agentic workloads and mandates alignment with Kubernetes v1.35 technical primitives.
The more significant development is llm-d, accepted as a CNCF Sandbox project in March 2026. Founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA, llm-d treats distributed LLM inference as a first-class cloud-native workload with the explicit goal: any model, any accelerator, any cloud. It introduces a DisaggregatedSet operator that separates model weights, KV cache, and compute into independently scalable Kubernetes resources.
Google is aligning GKE’s inference gateway with llm-d’s disaggregated serving model. If llm-d matures, it becomes the abstraction layer that makes the six-category taxonomy manageable — you write your serving logic once and target any certified platform. For teams running K8s AI scheduling with DRA, this is the next layer up.
What This Means for Platform Teams
Three concrete implications for engineers making infrastructure decisions in the second half of 2026:
- Stop treating inference as a subset of training infrastructure. The workload profiles are fundamentally different. Inference needs always-on, low-latency, auto-scaling serving — not batch-oriented, checkpoint-restart training jobs. The inference-optimized platforms (Groq, Cerebras, Fireworks) exist because hyperscaler GPU instances are often overprovisioned for serving.
- Evaluate neoclouds for dedicated capacity, not just spot pricing. The quality gap has closed. With SemiAnalysis ratings putting neoclouds in gold tier alongside Oracle and Azure, the decision is now about contract structure, data gravity, and compliance — not reliability. CoreWeave’s $66.8B backlog proves enterprise buyers have already made this shift.
- Build for portability with llm-d as the target abstraction. The CNCF conformance program plus llm-d’s disaggregated serving model means Kubernetes-native AI deployment is becoming a reality. Teams that invest in deployment speed and portability today will avoid the vendor lock-in pain that early GPU adopters experienced during the Hopper-to-Blackwell transition.
The Decision Framework
The practical heuristic most teams are converging on: use a hyperscaler for data gravity and compliance-bound workloads, a neocloud or inference platform for dedicated high-volume serving, and an orchestration layer like SkyPilot or BentoML to manage cost optimization across both. The CNCF conformance program is reducing the switching cost, but lock-in still happens through proprietary serving APIs, custom quantization formats, and model weight distribution pipelines.
The teams that will win this decade are the ones treating AI infrastructure as a multi-provider portfolio problem — not a single-vendor selection exercise.
References
- The New Stack — A practical guide to the 6 categories of AI cloud infrastructure in 2026
- Google Cloud Blog — AI infrastructure at Next ’26
- CNCF — Welcome llm-d: Evolving Kubernetes into SOTA AI infrastructure
- CNCF — Nearly doubles certified Kubernetes AI platforms
- Tech Fund — NeoCloud Economics: Nebius and CoreWeave
- LinkedIn — AI Infrastructure Shifts to Power, Cooling, and Execution Speed
- Interconnect — CoreWeave vs Nebius