DRA Killed the GPU Device Plugin: K8s AI Scheduling in 2026

NVIDIA’s DRA Donation Ends GPU Blindness

At KubeCon Europe 2026 in Amsterdam, NVIDIA killed the GPU device plugin model by donating its Dynamic Resource Allocation (DRA) driver for GPUs to the Cloud Native Computing Foundation. That single act retires the device plugin that has made Kubernetes treat your H100 identically to a T4 since 2017. The old model exposes GPUs as opaque integers — nvidia.com/gpu: 1 — with zero visibility into memory, compute capability, NVLink topology, or MIG configuration. If you run GPU workloads on Kubernetes and you are still on the device plugin, you are leaving 20–30% of your accelerator capacity to fragmentation source.

This is not a theoretical upgrade. DRA, combined with the KAI Scheduler and a restructured NVIDIA GPU stack, fundamentally changes how platform teams allocate, schedule, and share GPUs across multi-tenant AI clusters. Here is what the convergence looks like and what it means for production infrastructure.

The Device Plugin Model Is Dead

The Kubernetes device plugin API shipped in 2017. It solved a real problem at the time: exposing accelerators to the kubelet so containers could access them. But it was designed for a world where one pod consumed one entire GPU. That world is gone.

Modern AI inference clusters run dozens of models per node. Ten services each using 15% of a GPU still consume ten physical devices under the device plugin model — a waste tax that compounds at scale. The plugin cannot express “I need a GPU with 80 GB memory, Ampere architecture or newer, and NVLink connectivity to at least three peers.” You had to approximate this with manual node labels, taints, and custom scheduling logic.

The consequences are measurable. The CNCF annual survey from January 2026 reports that 66% of organizations hosting generative AI models now use Kubernetes for some or all inference workloads. Most of them are running the device plugin. The gap between what the scheduler can reason about and what the hardware actually provides has become an operational bottleneck.

DRA Replaces Integers With Claims

Dynamic Resource Allocation introduces three core API objects: ResourceClaim (a requested resource instance), ResourceClaimTemplate (generates claims per-pod), and DeviceClass (defines which driver handles allocation). The evolution has been deliberate: alpha in Kubernetes 1.26, redesigned in 1.31, beta in 1.32, and v1beta2 in Kubernetes 1.33.

Instead of requesting nvidia.com/gpu: 1, you write a CEL expression that the scheduler evaluates against structured device attributes:

  • Memory in bytes — request exactly what your model needs
  • Compute capability — filter by architecture (Ampere, Hopper, Blackwell)
  • MIG profile availability — select partitioned or full-device configurations
  • NVLink peer topology — co-locate pods that need high-bandwidth interconnect

The NVIDIA DRA driver discovers GPU hardware on each node, publishes these structured parameters to the Kubernetes API, and handles claim allocation by binding claims to specific GPU instances. AMD has introduced its own DRA driver for Instinct GPUs, confirming that DRA is becoming the cross-vendor standard — not an NVIDIA-specific path.

Critically, DRA and the legacy device plugin can coexist during migration. They use different request mechanisms. You can migrate workloads incrementally rather than scheduling a Big Bang migration across your fleet.

KAI Scheduler: The Orchestration Layer

DRA solves resource description. It does not solve scheduling policy. That is where the Kubernetes AI Infrastructure (KAI) Scheduler comes in — an NVIDIA open-source secondary scheduler that runs alongside kube-scheduler.

The default Kubernetes scheduler evaluates pods independently. For distributed training with eight workers, if seven schedule and one stays pending, the seven active workers hold GPU allocations at near-zero utilization while blocking capacity for everything else. Gang scheduling — all-or-nothing resource allocation — is now table stakes for any serious AI workload on Kubernetes.

KAI Scheduler adds:

  • Gang scheduling — all pods in a job start together or none start
  • Fair-share queuing across teams with guaranteed quotas and burst limits
  • Bin-packing to reduce fragmentation
  • Priority-based preemption with eviction handling (production inference preempts training jobs)
  • Spot-aware scheduling with automatic rescheduling on spot reclamation

You define queues with guaranteed GPU allocations and burst ceilings. A training team might get 8 GPUs guaranteed, burst to 16, with priority 50. An inference team gets 16 GPUs guaranteed, burst to 32, with priority 100. When the inference team needs capacity, training jobs get preempted cleanly. This replaces the ad hoc PriorityClass and ResourceQuota juggling that most platform teams have been maintaining manually — a pattern familiar to anyone who has worked with Kubernetes admission controllers for policy enforcement.

The Converged 2026 Kubernetes AI Stack

The tooling has consolidated around a specific set of CNCF and vendor-backed components. The 2026 AI/ML on Kubernetes stack looks like this:

LayerToolPurpose
GPU ManagementNVIDIA GPU Operator + DRA DriverDriver lifecycle, structured resource claims
SchedulingKAI Scheduler + KueueGang scheduling, fair-share, quota enforcement
Distributed TrainingKubeRay / Ray OperatorHead/worker topology, RayCluster/RayJob CRDs
Inference ServingvLLM + KServePagedAttention, continuous batching, autoscaling
Large Model ServingLeaderWorkerSet + llm-dMulti-host inference for 400B+ parameter models
Pipeline OrchestrationArgo Workflows / Kubeflow PipelinesDAG-based training and deployment workflows

Kueue has emerged as the community standard for batch workload management, handling quota management and fair-share scheduling. Volcano pioneered gang scheduling. The two layer together: Kueue enforces quota and holds jobs that exceed it; Volcano manages the gang scheduling once Kueue releases a job. A critical production caveat — submitting jobs directly to Volcano bypasses quota enforcement entirely, which causes resource contention that is painful to debug.

GPU Sharing Without the Waste Tax

Under the device plugin, GPU sharing meant choosing between time-slicing, MPS, or MIG — each with tradeoffs in isolation, throughput, and configuration complexity. Time-slicing is simplest but provides no memory isolation. MPS enables concurrent kernels but no fault isolation. MIG provides hardware-enforced isolation but only on A100 and newer, with rigid partition sizes.

DRA changes this by making GPU partitioning a first-class scheduling operation. Instead of pre-configuring MIG profiles on nodes and hoping workloads fit, the DRA driver can configure MIG partitions dynamically at allocation time. The ResourceClaim specifies what the workload needs; the driver figures out how to carve the hardware. This shifts the configuration burden from node-level static setup to workload-level declarative requests.

For inference workloads specifically, the impact is significant. A cluster running 40 models across varying GPU requirements can now pack workloads based on actual memory and compute needs rather than wasting entire devices on fractional utilization. Combined with advanced Kubernetes scaling techniques like Karpenter for right-sizing node provisioning and scale-to-zero via Knative for idle endpoints, the cost reduction potential is substantial — particularly when inference costs are already under hardware-level pressure.

What This Means for Platform Teams

The migration path is incremental. DRA and the device plugin coexist. But the strategic direction is clear: the device plugin model is a dead end. NVIDIA’s CNCF donation signals that the ecosystem is investing in DRA as the long-term standard.

If you are building or operating AI infrastructure on Kubernetes, the pragmatic next steps are:

  1. Upgrade to Kubernetes 1.33+ — DRA v1beta2 is the minimum viable version. Earlier betas had significant API churn.
  2. Install the NVIDIA DRA driver alongside your existing GPU Operator — verify structured device discovery with kubectl get deviceclass.
  3. Evaluate KAI Scheduler for multi-tenant clusters — especially if you mix training and inference workloads with different priority profiles.
  4. Migrate one workload type at a time — start with inference workloads that benefit most from fractional GPU allocation, then move training jobs that need gang scheduling.
  5. Monitor queue depth and GPU utilization by namespace — the borrowing model in Kueue can cause workloads to grow into borrowed capacity and fail when reclaimed mid-run.

The CNCF survey data is unambiguous: 82% of container users now run Kubernetes in production, and AI workloads are the fastest-growing category. The tooling has caught up. The question is no longer whether Kubernetes can handle AI workloads — it is whether your team is running the stack that makes it efficient.

References