K8s GPU Clusters Waste 95% of Capacity — Top Teams Don’t

Production Kubernetes GPU clusters across AWS, GCP, and Azure average just 5% utilization — with CPU at 8% and memory at 20%. CPU overprovisioning jumped from 40% to 69% year over year. GPU prices are rising for the first time since 2006. The top-performing clusters sustain 49% GPU utilization, proving the 10x gap is technique, not hardware. Automated rightsizing, Spot GPU placement with region-aware selection, and shared GPU scheduling close the gap without sacrificing reliability.

K8s GPU Clusters Run at 5%

These figures come from Cast AI’s 2026 State of Kubernetes Optimization Report, which measured tens of thousands of production clusters before any optimization was applied. They are not estimates — they are direct measurements. The gap between provisioned and consumed compute is not shrinking. It is compounding.

Overprovisioning Is Structural, Not Accidental

CPU overprovisioning jumped from 40% to 69% year over year. Memory overprovisioning sits at 79%. The mechanics are predictable: engineering teams pad resource requests to avoid throttling and OOM kills, cluster autoscalers respond to those inflated requests by provisioning more nodes, and nobody circles back to revise the numbers after deployment. Helm charts embed conservative defaults that propagate across environments. The result is a self-reinforcing cycle where perceived demand drives real supply costs.

The counterintuitive finding from the report: generous overprovisioning does not prevent OOM kills. One cluster averaging 40–50 OOM kills per measurement interval under static padding dropped to near zero after automated rightsizing — while also cutting provisioned CPUs by roughly half. The rightsizing mechanism increased memory limits for workloads under genuine pressure, which static overprovisioning consistently missed. Efficiency and reliability are not a tradeoff here. Automated rightsizing delivers both.

GPU Economics Just Got Harder

An idle CPU core costs fractions of a cent per hour. An idle H200 costs several dollars. For the first time since EC2 launched in 2006, GPU prices are rising, not falling. AWS raised H200 Capacity Block prices by 15% in January 2026, breaking a two-decade pricing trend. At 5% average utilization, the cost-per-useful-FLOP math is brutal.

The scarcity feedback loop amplifies the problem. Teams hoard GPU capacity because they cannot reliably get it back. That hoarding drives prices higher, which makes hoarding seem rational. Spot adoption for GPU workloads remained below 2% through most of 2025 — partly because Spot GPU capacity simply did not exist for most instance types.

That is starting to change in 2026, but the picture is uneven. For T4 instances on AWS, survival probability above 0.9 holds for a full 24-hour window in eu-west-3. In eu-central-1 and us-east-1, the same instance type drops below 0.2 survival probability — an 80% chance of interruption within a day. Region selection for Spot GPU workloads is a reliability decision, not a cost optimization. The same instance in the right region is the difference between a training run that completes and one requiring constant checkpoint recovery. By selecting favorable regions in real time, teams see 2–5x cost differences on Spot pricing alone.

RegionT4 Spot 24h SurvivalRisk Level
eu-west-3>90%Low — suitable for long training runs
eu-central-1<20%High — frequent interruption expected
us-east-1<20%High — frequent interruption expected

GPU Sharing: Known Solution, Minimal Adoption

The standard deployment model assigns each model a dedicated GPU instance. For most inference workloads, this is wasteful. Request rates are bursty with long idle periods between them. On dedicated instances, idle time costs the same as peak utilization.

One case study from the report: ALLEN Digital was running seven models on SageMaker — three open-source and four custom. GPU instances ran continuously serving intermittent load. After migrating to Kubernetes with GPU time-slicing, a 50/50 on-demand/Spot split, and node bin-packing, results broke down as follows:

Optimization StepSavingsMechanism
GPU time-slicing20%Multiple models share same GPU with temporal isolation
Model consolidation30–40%Bin-packing models onto shared instances
Full stack rightsizing70%+ totalCPU, memory, and GPU allocation against actual consumption

Latency held throughout. The key enabler was not a new algorithm — it was putting multiple workloads on shared hardware with intelligent scheduling based on actual compute and memory needs, something Kubernetes makes straightforward but most teams never attempt.

Top Performers Close the 10x Gap

One cluster in the dataset — 136 H200s sustaining 49% GPU utilization — proves the ceiling is not theoretical. The fleet average is 5%. The gap is almost entirely technique, not hardware. The organizations closing that gap share three practices.

Continuous rightsizing. Not a one-time pass at deployment. Continuous monitoring and adjustment of resource requests against actual consumption. This applies to both CPU/memory limits and GPU allocation.

Automated Spot placement. Not manual decisions about which region or pool to use. Automated selection across instance pools, availability zones, and regions with fallback to on-demand when availability drops. No team can monitor Spot survival curves in real time manually.

Shared GPU scheduling. Not one model per instance. Shared instances with intelligent scheduling that places multiple workloads based on actual compute and memory profiles. Time-slicing and MIG partitioning handle the isolation guarantees.

DRA Changes the Scheduling Foundation

At KubeCon Europe 2026, NVIDIA donated its DRA GPU driver to CNCF and Google released an open-source DRA TPU driver. Dynamic Resource Allocation replaces the old Device Plugin model where the scheduler blindly claimed integer GPU counts from node labels. Under DRA, workloads describe what they need — VRAM, architecture, interconnect topology — and the control plane figures out how to satisfy that claim across the cluster. According to DoiT’s engineering team, Device Plugins provided the scheduler with no useful information about hardware attributes, forcing administrators into manual node-label mapping that does not scale.

DRA introduces DeviceClass abstractions — admins define named classes like high-memory-gpu or low-latency-inference, and developers request by name. ResourceSlice objects advertise available hardware in real time. For mixed GPU/TPU clusters, this means a single scheduling path replaces vendor-specific workarounds.

The production stack for distributed AI training, as described in CloudOptimo’s Kubernetes AI infrastructure guide, layers Kueue for quota and admission control with Volcano for gang scheduling. When a developer submits a RayJob manifest, Kueue holds it in suspension if quota is exhausted while infrastructure scales in the background. Volcano monitors the scheduling pool continuously, committing workers to nodes only when the complete set can be satisfied simultaneously. Version drift between the Ray Operator and Kueue’s admission webhook is a common source of jobs that appear queued but never schedule — component version compatibility has become a first-class operational concern.

For teams already running GPU schedulers, the gap between scheduler-level allocation and actual compute consumption can be substantial — a topic we covered in detail in our analysis of GPU scheduler waste patterns.

Cold Starts Compound the Waste Problem

Serverless GPU platforms promise scale-to-zero billing. The cost is a 40–90 second penalty on first request, as detailed in Spheron’s analysis of GPU cold starts. For a 70B model, the breakdown is roughly: container pull 4–8 minutes (uncached), weight load 40–45 seconds, CUDA context and graph capture 30 seconds, KV cache warmup 10 seconds. Even with cached images, cold start sits around 85 seconds for large models.

This is directly relevant to the utilization problem. If scale-to-zero is your GPU sharing strategy, every cold start imposes a user-facing latency spike. Teams that maintain warm replicas avoid this but pay for idle capacity — the same capacity the utilization report identifies as wasted. The resolution is not to pick one extreme or the other but to combine persistent NVMe caches for weight storage, CUDA graph snapshots, and minimum replica configurations calibrated to actual traffic patterns rather than padded estimates. For inference-specific optimization on GPU memory, our breakdown of MoE inference memory costs covers the model-side factors that compound infrastructure waste.

Waste Does Not Correct Itself

The data from 2026 is consistent with the pattern from prior years: Kubernetes adoption scales, efficiency declines proportionally, and the gap between paid-for and consumed compute widens. The overprovisioning problem is not a knowledge problem — the techniques are well understood. It is a process problem. No team owns the feedback loop between what was requested at deployment and what is actually consumed six months later.

For AI workloads specifically, the economics are accelerating the urgency. GPU prices are rising for the first time in two decades. The organizations that closed the 10x utilization gap did it through continuous automation — not hero efforts during outages, but systematic rightsizing, intelligent Spot placement, and shared GPU scheduling running as platform-level services. The 49% utilization cluster is not an outlier. It is what the average should look like.

References