On May 6, 2026, NVIDIA donated Multipath Reliable Connection (MRC) to the Open Compute Project, turning a closed Spectrum-X optimization into an open RDMA transport. That act ended the decade-long pretence that AI cluster networking was settled. Today three open transport protocols — UEC’s UET, Google’s Falcon, and NVIDIA’s MRC — are racing to replace RoCEv2, and the choice locks a fabric architecture into your data center for the next three to five years.
The Fabric Now Outranks GPUs
Distributed training is synchronized by design. Collectives like AllReduce, AllGather and reduce-scatter must complete across every GPU in the cluster before the next iteration begins. The slowest packet — not the average packet — sets the pace of the run. As Mark Handley, Brad Karp and Costin Raiciu argued at a recent UCL workshop, this turns the interconnect into the rate-limiting variable once you scale past a few hundred GPUs.
The numbers are not subtle. A large-scale H100 cluster can spend 15–30% of its cycles idle, waiting on the network during big all-reduce operations, according to a 2026 decision guide from Spheron’s engineering team. Inflect’s cluster-sizing analysis is even more direct: a 1,024-GPU, 30-day pre-training run priced at roughly $2.2M in compute loses about $110,000 plus 1.5 schedule days to a modest 5% fabric-induced inefficiency. The GPU is no longer the bottleneck. The transport is.
That economic reality is why the standards war matters. The three contenders below are not academic — they ship in silicon you are about to buy.
RoCEv2’s Brittleness Problem
RoCEv2 (RDMA over Converged Ethernet) is the incumbent in nearly every Ethernet-based AI cluster built before 2026. Meta, for example, runs tens of thousands of GPUs on a purpose-built RoCEv2 backend fabric, encapsulating RDMA inside UDP packets and forcing lossless behaviour with Priority Flow Control (PFC). Their SIGCOMM 2024 paper documents the architecture in detail — a two-stage Clos “AI Zone” with separate frontend and backend fabrics.
The problem is that RoCE was designed for HPC, not for synchronized collectives at 400–800 Gbps. It assumes a lossless Ethernet built on PFC, and it treats packet reordering as loss. Under multipath load balancing — which every modern fat-tree wants — that assumption breaks. Congestion control schemes like DCQCN react too slowly for the sub-millisecond bursts collectives generate. The result is the tail-latency stall that quietly eats 15–30% of your throughput. RoCEv2 can match InfiniBand on bandwidth on paper; in production at scale, the configuration surface is where deployments die.
This is the gap UET, Falcon and MRC are built to close.
UET: Packet Spraying And Switch Trims
The Ultra Ethernet Consortium (UEC), now roughly 200 members strong under the Linux Foundation, publishes UET (Ultra Ethernet Transport). The Ultra Ethernet Specification v1.0 (June 2025) is the canonical reference. UET’s defining bet is packet spraying: every packet takes a different path through the fabric using entropy in the UDP source port. This maximizes link utilization but forces the receiver to tolerate reordering — which UET handles in silicon.
Three mechanisms make spraying survivable at scale:
- Packet trimming and NACK. Instead of dropping, congested switches trim packets to a header to signal loss. Receivers NACK the missing sequence immediately, triggering fast retransmit.
- NSCC congestion control. A hybrid scheme that consumes both ECN markings and queueing delay, reacting faster and more fairly than DCQCN across synchronized incasts.
- Four reliability modes. Reliable ordered, reliable unordered, reliable unordered idempotent, and unreliable unordered — letting collectives pick the cheapest semantics that work.
UET assumes a best-effort fat-tree with inevitable asymmetries, small incasts and aggressive multipath use. It is exposed to collective libraries (NCCL, MPI, vendor CCLs) via Libfabric. The catch: UET requires switch support for trimming, CSIG-style congestion signaling and link-layer retransmit. That is a non-trivial ask, and it is why the UEC membership roster matters more than the spec itself.
Falcon: Multipath Subflows, No Special Silicon
Google’s Falcon transport protocol, also hosted at the OCP, takes the opposite philosophical stance: make Ethernet good enough without requiring switches to do anything new. Where UET sprays at the packet level, Falcon uses multipath subflows — think MPTCP, but in hardware. A single connection can spawn k subflows, each with independent congestion control.
Falcon’s headline properties, as summarized in the Midokura technical comparison: roughly 3 µs one-way latency, 100,000 connections per host, delay-based congestion control that avoids PFC entirely, and Tail Loss Probes (RACK-TLP) for fast recovery of missing packets. Because it rides standard Ethernet with no trimming and no link-layer retransmit, it fits cloud providers running multi-tenant, oversubscribed fabrics where AI is one workload among many — storage, RPC, search.
The trade-off: Falcon was not designed first for symmetric, dedicated AI fat-trees. It is a general-purpose low-latency transport that happens to be very good for collectives. For pure AI-factory deployments that is a feature; for hyperscalers with mixed workloads it is the whole point.
MRC: NVIDIA’s Just-Open-Sourced Bet
The newest entrant is NVIDIA’s MRC, open-sourced through OCP on May 6, 2026. According to NVIDIA’s Gilad Shainer, MRC lets a single RDMA connection spread traffic across multiple network paths and dynamically steer around congestion in hardware. Its signature is failure bypass: path failure is detected and traffic rerouted in microseconds, entirely in silicon, without involving the host stack. For clusters where thousands of GPUs must stay synchronized across a multi-week run, that is the difference between a clean checkpoint and a 12-hour rollback.
MRC is already in production at serious scale. OpenAI runs it on Blackwell-generation hardware (“MRC’s end-to-end approach enabled us to avoid much of the typical network-related slowdowns,” said Sachin Katti). Microsoft’s Fairwater and Oracle’s OCI Abilene — two of the largest purpose-built AI factories — both rely on MRC. Crucially, MRC coexists with Spectrum-X Adaptive RDMA and other custom protocols on ConnectX SuperNICs and Spectrum-X switches, and it pairs with multiplanar network designs where multiple independent fabrics each provide alternate GPU-to-GPU paths.
The strategic move here is subtle. By donating MRC to OCP while optimizing it first for Spectrum-X, NVIDIA is doing what it did with CUDA years ago: opening the interface while owning the silicon that runs it fastest. Whether the industry reads that as generosity or vendor lock-in depends on whether non-Spectrum-X switches can hit equivalent performance.
The Cost Math That Decides It
None of this is theoretical procurement theatre. The transport you choose is a multi-year commitment, and the wrong pick compounds silently. The Inflect model — 5% fabric inefficiency on a $2.2M run — is the floor, not the ceiling. Spheron’s 15–30% idle-cycle figure is the realistic mid-range once you account for tail latency under real contention.
| Protocol | Multipath strategy | Switch requirements | Best deployment fit |
|---|---|---|---|
| RoCEv2 | ECMP (flow-based) | PFC + DCQCN tuning | Incumbent; works but fragile at scale |
| UET | Packet spraying | Trim, CSIG, link-layer retransmit | Purpose-built AI fat-trees |
| Falcon | Multipath subflows | None (standard Ethernet) | Multi-tenant cloud, mixed workloads |
| MRC | Multipath reliable connection | Spectrum-X optimization | Gigascale AI factories (OpenAI, MSFT, OCI) |
InfiniBand NDR/XDR with SHARP still wins on raw all-reduce latency for dedicated training clusters — SHARP moves the gradient reduction into switch silicon, collapsing O(log N) round-trips toward O(1). But InfiniBand is a single-vendor stack, and the whole point of UET/Falcon/MRC is to give Ethernet an answer that does not require buying Quantum-2 switches.
What To Verify Before You Commit
Before signing a cluster deal in 2026, pressure-test the transport layer directly:
- Which transport ships on the NIC? UET requires new switch ASICs. Falcon runs on standard Ethernet but needs host/Falcon-capable NICs. MRC is real today on Spectrum-X. Ask for the BOM, not the roadmap.
- What does the failure-bypass latency actually measure? MRC advertises microsecond reroute. RoCEv2 has no equivalent — a link failure can stall collectives for milliseconds, which on a synchronized job means a checkpoint rollback.
- How is congestion signaled? UET mixes ECN and delay (NSCC). Falcon is delay-based with RACK-TLP. RoCEv2 leans on DCQCN, which is the part most likely to be misconfigured. Get the tuning matrix in writing.
- Is the fabric symmetric? UET’s packet spraying assumes a clean fat-tree. On an asymmetric or oversubscribed topology — common in shared clouds — spraying can amplify incast rather than relieve it.
- What is the multiplanar story? NVIDIA Spectrum-X multiplane and OpenAI’s deployment both rely on independent fabric planes for resiliency. If your provider cannot articulate how their transport maps to multiple planes, you are buying a single point of failure at gigascale.
Related Reading
If you are sizing the broader inference and scheduling picture alongside the fabric decision, see our deep-dive on Prefill-Decode Disaggregation: NVIDIA’s 7x Inference Fix and our analysis of how K8s GPU clusters waste 95% of capacity while top teams do not. For the cost side of the equation, our piece on cloud egress fees now surpassing GPU compute covers the other half of the TCO model.
References
- NVIDIA — Spectrum-X Ethernet Sets the Standard for Gigascale AI, Now With MRC (May 6, 2026)
- Midokura — Hardware Transports for AI Networking: UET vs Falcon and Beyond
- Ultra Ethernet Consortium — UET Specification v1.0 (June 11, 2025)
- OCP — Falcon Transport Protocol Specification (opencomputeproject/OCP-NET-Falcon)
- Meta Engineering — RoCE Networks for Distributed AI Training at Scale (SIGCOMM 2024)
- Spheron — GPU Networking: InfiniBand vs RoCE vs Spectrum-X Decision Guide (2026)
- Inflect — GPU Cluster Networking: InfiniBand vs. RoCE for Large-Scale AI Training