NVIDIA Paid $20B for Groq. The Inference Chip Gap Is Real

On December 24, 2025, NVIDIA paid $20 billion for a non-exclusive license to Groq’s Language Processing Unit (LPU) architecture and hired Groq’s CEO Jonathan Ross plus core engineering talent. The deal signals that even NVIDIA, with ~80% of AI training silicon, recognizes GPUs are architecturally wrong for inference decode — the phase where specialized silicon already delivers 5–10x higher throughput at lower per-token cost (TECHi, BraivIQ).

The Memory Wall Is the Bottleneck

LLM inference runs in two phases: prefill (processing the prompt in parallel — compute-bound) and decode (generating tokens one at a time — memory-bound). During decode, the model must read its full weight matrix from memory for every single token generated. A 70B parameter model in FP8 requires roughly 70 GB of weights read per token. On a conventional GPU, those weights live in off-chip High Bandwidth Memory (HBM), and the processor spends most of its time waiting for data to traverse the interconnect (Sesame Disk).

This is the “memory wall.” An H100 delivers about 3.35 TB/s of HBM3 bandwidth. If your 70B FP8 model needs 70 GB per token read, the theoretical ceiling is roughly 47 tokens/sec per GPU — before accounting for KV cache, attention computation, and batching overhead. In practice, commodity H100 endpoints serve 100–150 tokens/sec across a batched workload, which is why a single user’s stream feels slow at 30–50 tokens/sec when the GPU is shared (Digital Applied).

The bottleneck is not FLOPS. NVIDIA’s Blackwell ships 2,250 TFLOPS of FP8 — more compute than any inference workload can consume at the decode phase. The problem is getting weights to the compute units fast enough. This is the gap specialized silicon attacks.

Why Groq’s LPU Wins at Decode

Groq’s Language Processing Unit makes three architectural decisions that flip the GPU model upside down. First, it uses on-chip SRAM as primary working storage instead of off-chip HBM. The Groq 3 LPU, unveiled at GTC 2026, delivers 150 TB/s of on-chip SRAM bandwidth — roughly 7x faster than the HBM on NVIDIA’s next-generation Vera Rubin GPU. Weights stay next to compute units, eliminating the interconnect round-trip that throttles GPU decode (TECHi).

Second, the LPU uses deterministic execution. Every operation follows a predictable path through the chip. There are no cache misses, no speculative scheduling surprises, no tail-latency spikes at the 99th percentile. GPU inference latency varies because memory access patterns are unpredictable under concurrent batching. Groq’s latency is flat — sub-300ms time-to-first-token regardless of load (APIScout).

Third, the chip is inference-only. It does not support training, gradient computation, or backward passes. This removes circuitry that GPUs dedicate to general-purpose flexibility, freeing die area for the memory hierarchy that matters for autoregressive generation. The trade-off is stark: you cannot fine-tune on an LPU, and you can only run models the compiler has been adapted to support.

The Benchmark Gap Is Not Marginal

The performance differential between specialty inference silicon and commodity GPUs is not a rounding error. It is an order of magnitude on the metrics that determine user experience and unit economics (Digital Applied, APIScout).

Workload (70B-class model)Groq LPUH100 / H200 GPUDifferential
Output throughput (tok/s)750100–1505–7x
Peak output (Llama 4 Maverick)1,200+170–3004–7x
Time-to-first-token (P50)<300ms800ms–2s3–4x
Latency consistency (P99)DeterministicVariable (spikes)No jitter
Throughput scaling under loadLinearDegradesPredictable

These are not synthetic numbers. They come from independent benchmarks published by Artificial Analysis and direct production testing across seven serverless inference providers in Q2 2026. On Llama 4 70B output decode, Groq’s LPU hits 750 tokens/sec; Cerebras’s wafer-scale engine hits 600+. A typical H100 endpoint runs 100–150 (Digital Applied).

Cerebras: Wafer-Scale Alternative

Cerebras takes a different architectural path to the same problem. Its Wafer Scale Engine 3 (WSE-3) is an entire silicon wafer functioning as a single processor — 44 GB of on-chip SRAM, eliminating the multi-chip partitioning that fragments large models across GPU clusters. The WSE-3 achieves 600+ tokens/sec on 70B-class models by keeping the full model resident in on-chip memory, avoiding the cross-GPU communication overhead that NVLink and InfiniBand exist to mitigate but cannot eliminate (BraivIQ).

The architectural philosophy differs from Groq. Groq optimizes for deterministic single-stream latency; Cerebras optimizes for sustained batch throughput on large models. Both attack the memory wall. Both sacrifice generality. Neither runs proprietary frontier models (no GPT, no Claude, no Gemini) — they serve open-weight models only, which is the single biggest constraint for teams that need frontier reasoning capabilities (APIScout).

What NVIDIA Actually Bought

The Groq deal structure reveals NVIDIA’s strategic calculus. Under the terms signed December 24, 2025, NVIDIA paid $20 billion for non-exclusive rights to Groq’s LPU architecture — a 2.9x premium over Groq’s prior $6.9 billion valuation. NVIDIA hired Groq’s founder and CEO Jonathan Ross, president Sunny Madra, and the majority of Groq’s engineering team. Groq technically remains independent under new CEO Simon Edwards and retains full IP ownership, including the right to license to others (TECHi).

What NVIDIA gets is the Groq 3 LPU: Samsung 4nm process, 150 TB/s on-chip SRAM bandwidth, 315 PFLOPS FP8 per rack, and a claimed 35x throughput per megawatt versus Blackwell for trillion-parameter models. Shipping begins Q3 2026. Senators Warren and Blumenthal opened an investigation on March 20, 2026, arguing the deal is a reverse acqui-hire structured to evade Hart-Scott-Rodino antitrust filing. The DOJ and FTC are reviewing (TECHi).

The competitive fallout matters for procurement teams. If the 35x efficiency claim holds, AMD’s MI450 faces the highest threat. Google TPU and Amazon Inferentia are partially insulated by captive cloud ecosystems. The deal effectively gives NVIDIA a two-architecture strategy: GPUs for training and prefill, LPUs for decode — exactly the disaggregation pattern that leading inference engines are already implementing (TECHi).

Decision Matrix: When Specialty Silicon Wins

For platform teams making 2026–2028 compute capacity decisions, the inference hardware choice is no longer single-vendor. The same open-weight model spreads 6x in price and 5–7x in latency across providers. The right architecture routes workloads by characteristics, not by vendor preference (Digital Applied).

  • Real-time chat and voice agents: Groq LPU or Cerebras. Sub-300ms TTFT is the difference between a product that feels instant and one users abandon. Premium pricing (2–3x Together’s per-token rate) is justified when latency SLA is the product.
  • Batch and background processing: Together AI or Fireworks on H100/H200 clusters. Cheapest per-token cost in the market ($0.65/1M for Llama 4 70B at batch tier vs $4.20/1M at the most expensive listed price). Latency does not matter when no user is waiting.
  • Frontier model serving: Native APIs (OpenAI, Anthropic, Google). Specialty silicon cannot serve proprietary models. If your workload requires GPT-5.2 or Claude reasoning, GPU infrastructure behind the API is your only path.
  • Regulated enterprise: Anyscale Endpoints. Built on Ray, ships HIPAA/SOC 2/EU data residency by default. Higher per-token rate (1.5–2x Together) but the compliance posture smaller vendors cannot match.
  • Niche or custom fine-tunes: Replicate or OctoAI. They run almost any model via container when the big providers do not host it. Higher per-token cost, but the alternative is self-hosting for a single model.

The practical pattern emerging in 2026 is multi-vendor routing: Groq for latency-critical paths, Together for steady-state volume, and native APIs for frontier reasoning — managed behind a single gateway that routes by workload class. For teams optimizing the GPU side of that stack, CUDA Graphs and torch.compile still deliver measurable decode speedups, and the economics of self-hosting via NVIDIA NIM remain favorable above specific volume thresholds.

What Changes If the Deal Closes

If NVIDIA integrates LPU technology into its inference stack, the competitive landscape compresses. A combined GPU+LPU platform from a single vendor eliminates the integration tax of routing across GroqCloud and NVIDIA endpoints separately. The disaggregation pattern — prefill on GPU, decode on LPU — becomes a first-class product feature rather than a custom engineering effort (TECHi).

But the regulatory outcome is uncertain. If the FTC blocks the deal or forces divestment, Groq continues independently with its IP and NVIDIA loses its fastest path to inference-specialized silicon. Either way, the architectural insight stands: autoregressive LLM decode is a memory-bandwidth problem, and the chip that solves memory wins the inference market. For teams who learned from the 99KB community fix that unlocked 5x MoE inference, the lesson is the same — the bottleneck is rarely where the marketing says it is.

References