AI Cloud in 2026: Six Categories Platform Teams Must Know

Inference Is Two-Thirds of AI Compute

Deloitte’s 2026 TMT Predictions estimates inference accounts for roughly two-thirds of all AI compute this year, a structural shift that has fractured the cloud market into six distinct provider categories. NVIDIA’s Blackwell and GB200 architectures have flooded the market with new GPU options, and the question “which cloud for my AI workload” no longer has a three-hyperscaler answer. Platform teams now face dozens of sub-decisions across training, fine-tuning, real-time serving, and agentic orchestration.

The Six AI Cloud Categories

A working taxonomy has emerged from the fragmentation. Each category optimizes for a different workload profile, and most production teams end up spanning at least two or three.

Category	Key Providers	Strength	Weakness	Best For
Traditional Hyperscalers	AWS, Azure, GCP, Oracle	Ecosystem depth, compliance	Higher per-GPU cost, slow provisioning	Regulated enterprise, hybrid
Neoclouds	CoreWeave, Lambda, Nebius, Crusoe	Bare-metal GPU perf, K8s-native	Limited non-GPU services	Frontier training, large-scale inference
Developer Clouds	DigitalOcean, Vultr, Hyperstack	Transparent pricing, fast onboarding	Smaller GPU fleets	Prototyping, mid-market
Inference Platforms	Fireworks, Groq, Cerebras, Together	Ultra-low latency, purpose-built serving	No training, model constraints	Real-time agents, chatbots
GPU Marketplaces	Vast.ai, TensorDock, RunPod	Lowest hourly prices, broad hardware	Variable reliability, weak SLAs	Experimentation, batch work
Orchestration Layers	BentoML, SkyPilot, Anyscale	Provider-agnostic, portable	Abstraction overhead	Multi-cloud, workload migration

Neoclouds Are the Breakout Category

The numbers are staggering. CoreWeave went public in March 2025 and carries a contracted revenue backlog of $66.8 billion as of December 2025, up from $55.6 billion just one quarter prior. NVIDIA invested $2 billion in January 2026 to accelerate CoreWeave’s buildout toward 5 GW of AI factories by 2030. Lambda Labs closed a $1.5 billion round in late 2025 and is deploying GB300 NVL72 GPUs into Azure through a Microsoft partnership.

Nebius signed a $3 billion deal with Meta in February 2026 and targets 800 MW to 1 GW of connected capacity by year-end. Their contracted backlog is growing faster than they can provision hardware — CEO Arkady Volozh noted that demand for Blackwell capacity exceeded supply, and deal sizes were limited only by available infrastructure.

SemiAnalysis ClusterMAX 2.0 ratings put Nebius, Oracle, Azure, and CoreWeave in the top gold tier for infrastructure quality, a validation that neoclouds now compete directly with hyperscalers on reliability — not just price.

Oracle and the Stargate Factor

Oracle’s positioning deserves special attention because it breaks the assumption that only NVIDIA-native clouds win at scale. The Stargate Initiative, a joint venture with OpenAI and SoftBank announced in January 2025, plans nearly 7 gigawatts of AI data center capacity across multiple US sites. The flagship campus in Abilene, Texas is already operational, deploying up to 450,000 GB200 superchips.

OCI Superclusters scale to 131,072 NVIDIA GPUs per cluster as of mid-2025, underpinned by the Zettascale10 architecture. Oracle’s strategy is essentially “become the physical layer for frontier model builders” — and OpenAI is both the anchor tenant and co-investor. AWS is expected to approach $200 billion in 2026 CapEx, but Oracle’s focused GPU density per cluster is hard to match.

Google’s Silicon Gambit: TPU v8

At Google Cloud Next ’26 in April, Google introduced two distinct TPU v8 chips optimized for different workloads. The TPU 8t packs 9,600 chips in a single superpod delivering 121 exaflops and 2 petabytes of shared memory, designed for high-throughput training. The TPU 8i triples on-chip SRAM to 384 MB and increases HBM to 288 GB to host massive KV caches entirely on silicon — a direct answer to the inference bottleneck.

The 8i also doubles ICI inter-chip bandwidth to 19.2 Tb/s and includes a Collectives Acceleration Engine that cuts on-chip latency by up to 5x. Google claims 80% better performance per dollar for inference compared to the prior generation. For teams already running on GKE, the TPU 8i is positioned as the cost-optimal path for high-volume Gemini or open-model inference without vendor lock-in to NVIDIA’s pricing cycle.

Google also announced A5X bare-metal instances powered by NVIDIA’s upcoming Vera Rubin NVL72 platform, co-engineering the open-source Falcon networking protocol with NVIDIA through the Open Compute Project.

CNCF Conformance and llm-d

The fragmentation problem has a standards answer. The CNCF launched its Kubernetes AI Conformance Program in November 2025, and by KubeCon EU in March 2026 it had nearly doubled certified platforms from 18 to 31. The program now validates agentic workloads and mandates alignment with Kubernetes v1.35 technical primitives.

The more significant development is llm-d, accepted as a CNCF Sandbox project in March 2026. Founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA, llm-d treats distributed LLM inference as a first-class cloud-native workload with the explicit goal: any model, any accelerator, any cloud. It introduces a DisaggregatedSet operator that separates model weights, KV cache, and compute into independently scalable Kubernetes resources.

Google is aligning GKE’s inference gateway with llm-d’s disaggregated serving model. If llm-d matures, it becomes the abstraction layer that makes the six-category taxonomy manageable — you write your serving logic once and target any certified platform. For teams running K8s AI scheduling with DRA, this is the next layer up.

What This Means for Platform Teams

Three concrete implications for engineers making infrastructure decisions in the second half of 2026:

Stop treating inference as a subset of training infrastructure. The workload profiles are fundamentally different. Inference needs always-on, low-latency, auto-scaling serving — not batch-oriented, checkpoint-restart training jobs. The inference-optimized platforms (Groq, Cerebras, Fireworks) exist because hyperscaler GPU instances are often overprovisioned for serving.
Evaluate neoclouds for dedicated capacity, not just spot pricing. The quality gap has closed. With SemiAnalysis ratings putting neoclouds in gold tier alongside Oracle and Azure, the decision is now about contract structure, data gravity, and compliance — not reliability. CoreWeave’s $66.8B backlog proves enterprise buyers have already made this shift.
Build for portability with llm-d as the target abstraction. The CNCF conformance program plus llm-d’s disaggregated serving model means Kubernetes-native AI deployment is becoming a reality. Teams that invest in deployment speed and portability today will avoid the vendor lock-in pain that early GPU adopters experienced during the Hopper-to-Blackwell transition.

The Decision Framework

The practical heuristic most teams are converging on: use a hyperscaler for data gravity and compliance-bound workloads, a neocloud or inference platform for dedicated high-volume serving, and an orchestration layer like SkyPilot or BentoML to manage cost optimization across both. The CNCF conformance program is reducing the switching cost, but lock-in still happens through proprietary serving APIs, custom quantization formats, and model weight distribution pipelines.

The teams that will win this decade are the ones treating AI infrastructure as a multi-provider portfolio problem — not a single-vendor selection exercise.

Cloud AI