In June 2026, NVIDIA merged two pull requests into its open-source Kubernetes AI scheduler that finally shipped container-level GPU memory hard isolation, ending a decade where multiple tenants sharing one accelerator could silently oversubscribe each other into OOM crashes. The KAI Scheduler now relies on HAMi-core, a CUDA interception library, to enforce caps that MIG cannot always provide and time-slicing never did. Here is how every method on the menu compares — and what each one actually costs you in latency, density, and blast radius.
The Idle-VRAM Problem Nobody Admits
A 7B parameter model in FP16 needs roughly 14 GB of VRAM. An H100 PCIe ships with 80 GB. Run one model per GPU — the Kubernetes default — and 66 GB sits idle, paid for but untouched (Spheron). The waste compounds because inference does not behave like training. Training keeps streaming-multiprocessor (SM) occupancy above 90% because the forward pass, backward pass, and optimizer step fill every cycle. Inference does not: requests arrive in bursts, and even mid-burst the decode phase is memory-bandwidth bound, not compute bound, so SM occupancy stays well under the raw TFLOPS the spec sheet advertises.
The practical ceiling for most serving deployments is 20–40% average SM utilization (Spheron). You are paying for 100% of a GPU and using a fraction of it — a pattern we dissected in how the worst K8s GPU clusters waste 95% of capacity. That gap is the entire reason GPU sharing exists, and the reason the choice between time-slicing, MIG, MPS, HAMi-core, and FCSP is now a board-level cost question, not a dev convenience.
Time-Slicing: Cheap, Dangerous, Default
Native Kubernetes assigns whole GPUs. A pod requesting nvidia.com/gpu: 0.5 is simply rejected — the resource type is integer-only by design (vCluster). The lowest-friction escape hatch is the NVIDIA GPU Operator’s time-slicing, which carves a device into logical replicas via a ConfigMap:
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4Four pods now take turns on one GPU. The appeal is obvious: it works on every NVIDIA card, needs no reboot, and costs nothing. The cost is that time-slicing provides no memory isolation and no fault isolation (vCluster). Each replica sees the full VRAM through nvidia-smi. If one tenant allocates more than its share, it can OOM-kill the neighbours — or, worse, corrupt their KV cache mid-stream. Context-switching also adds non-trivial latency overhead as the scheduler rapidly swaps processes. Time-slicing is fine for a dev box or a low-traffic internal endpoint. It is a liability the moment two teams or two customers share silicon.
MIG: Hardware Isolation, Rigid Slices
Multi-Instance GPU (MIG) is the gold standard for hard isolation on the hardware itself. It partitions the GPU at the silicon level into up to seven independent instances, each with its own VRAM, cache, and compute engines. A crash or memory error in one slice physically cannot affect the others, and each slice appears as a separate CUDA device (Spheron). On an H100 80GB, the standard profiles are 1g.10gb (7 instances, 10 GB each), 2g.20gb (3 instances), and 3g.40gb / 4g.40gb (2 instances).
The cost math is striking. A 7B INT4 model needs only ~4 GB, so it fits comfortably in a 10 GB 1g.10gb slice. (INT4 is the same lever behind how quantization halved a 70B model’s inference cost.) That yields seven independent inference endpoints for the price of one H100 — roughly $0.29/hr per slice against a single $2.01/hr H100, versus renting seven RTX 4090 instances at $0.51/hr each ($3.57/hr total) (Spheron). MIG’s rigidity is the trade-off: slices are fixed sizes, re-partitioning requires the GPU to clear its workloads, and older consumer cards (RTX series) do not support it at all. MIG is right for multi-tenant production when your model sizes map cleanly onto fixed slices.
MPS: Concurrency Without Memory Walls
The Multi-Process Service (MPS) sits between time-slicing and MIG. It is a CUDA daemon that replaces the GPU’s default time-multiplexing with a single shared context, letting multiple client processes submit kernels that run concurrently rather than being interleaved sequentially. The payoff is lower kernel-launch overhead and genuine parallel execution across processes, which is why MPS is popular for controlled internal environments running several inference workers from the same team (Spheron).
But MPS shares a single VRAM pool. There is no per-process memory cap, so a runaway allocation in one client can starve the others. Isolation is process-level, not tenant-level. MPS buys you density and throughput on trusted workloads; it does not buy you the ability to put two untrusted customers on one GPU. Most teams reach for MPS when they trust all the tenants (one org, one product) and want the concurrency gains MIG’s fixed slices cannot give them.
HAMi-core: CUDA Interception Goes Mainstream
This is where June 2026 changed the landscape. NVIDIA’s KAI Scheduler — the Kubernetes-native scheduler born from the Run:ai engine NVIDIA acquired in late 2024 and open-sourced in April 2025 — has long offered fractional GPU sharing. But its isolation was strictly cooperative: the scheduler summed requested memory shares and refused to over-commit at the booking layer, yet it did nothing to physically stop a container from oversubscribing at runtime. A pod that requested 2000 MiB could still see and allocate the full GPU memory through the CUDA API and nvidia-smi (HAMi project). In dev that is tolerable. In multi-tenant production it is a fatal gap: tenants can OOM each other and you cannot precisely cap any single container.
The two merged PRs fix that by wiring in HAMi-core, a CNCF Sandbox CUDA interception library. The mechanism is elegant and brutal in its simplicity. A MutatingWebhook injects the libvgpu.so library via ld.so.preload into every shared-GPU pod, along with a CUDA_DEVICE_MEMORY_LIMIT environment variable. Once the container starts, libvgpu.so intercepts every CUDA memory allocation call and enforces the cap at runtime, so a pod that requested 4096 MiB literally cannot allocate byte 4097 — nvidia-smi inside the container reports only the slice it was granted (HAMi project). KAI Scheduler decides who runs where; HAMi-core guarantees that is all they get. Cloud-native GPU scheduling has, in the project’s own words, moved from the cooperative-sharing era into the hard-isolation era.
FCSP: The Sub-Microsecond Contender
Interception has an overhead cost, and not every workload can afford it. BudEcosystem’s FCSP (Fixed Capacity Spatial Partition), detailed in a December 2025 paper, attacks the same problem from a different angle: a user-space virtualization framework that achieves sub-microsecond memory enforcement through lock-free data structures built on C11 atomics with cache-line-aligned layouts, replacing the semaphore-based synchronization older interceptors relied on. On the GPU-Virt-Bench suite, FCSP claims 1000× faster context creation (78 μs vs 84 ms for HAMi-core) and 3600× faster memory-limit enforcement (0.3 μs vs 1.1 ms), while delivering roughly 3× better multi-tenant isolation and 2× higher tenant density at under 5% performance degradation (BudEcosystem).
Those numbers are vendor-reported and need independent reproduction before they justify a migration, but the direction matters: the isolation layer is becoming a performance battleground, not just a correctness checkbox. If FCSP’s latency claims hold up under third-party benchmarking, the cost of hard isolation drops to the point where refusing to share GPUs stops being defensible on performance grounds.
Which Method Actually Fits Production?
The decision is not about finding the best technique in the abstract — it is about matching isolation strength to trust boundaries and workload shape. The table below collapses the trade-offs platform teams actually weigh:
| Method | Isolation | Memory model | Best fit |
|---|---|---|---|
| Time-slicing | None | Shared, uncapped | Dev / single-tenant test |
| MPS | Process-level | Shared, uncapped | Trusted tenants, same org |
| MIG | Hardware | Fixed slices | Multi-tenant when sizes are predictable |
| HAMi-core (in KAI) | Container hard cap | Intercepted, capped | Multi-tenant K8s production |
| FCSP | User-space hard cap | Lock-free enforcement | Latency-sensitive high density |
The pragmatic path most production teams converge on: use MIG where your model footprint is stable and fits the fixed slices, and reach for HAMi-core via KAI Scheduler for the long tail of variable-size workloads that MIG’s rigidity punishes. Pair either with continuous batching to claw back the GPU idle time that no isolation method fixes on its own. Time-slicing belongs behind a firewall, never on a customer-facing endpoint. The June 2026 KAI Scheduler integration matters precisely because it removes the last excuse for running cooperative sharing in production — the hard-isolation tooling is now first-party, scheduled, and free. The question is no longer whether to isolate. It is which isolation tax you are willing to pay per token.
References
- HAMi — HAMi-core Adopted by NVIDIA KAI Scheduler: GPU Sharing Enters the Hard-Isolation Era (June 2026)
- Spheron — Fractional GPUs for AI Inference: vGPU, MPS, and Right-Sizing (2026)
- vCluster — DIY GPU Sharing in Kubernetes: Time-Slicing, MIG & Workarounds (2026)
- BudEcosystem — FCSP: GPU Resource Isolation Framework for Multi-Tenant ML Workloads (Dec 2025)
- HAMi Project — Heterogeneous GPU Sharing on Kubernetes (CNCF Sandbox)