The 99KB Problem: How a Community Fix Unlocked 5x Faster MoE Inference on Blackwell Workstation GPUs

The 99KB Problem: How a Community Fix Unlocked 5x Faster MoE Inference on Blackwell Workstation GPUs

Running a 397-billion parameter model at 282 tokens per second on workstation hardware sounds impossible. Last week, a developer proved it’s not—they just had to rewrite part of NVIDIA’s kernel code first.

The fix, submitted as a pull request to FlashInfer, addresses a fundamental mismatch between NVIDIA’s datacenter and workstation Blackwell GPUs. If you’re running Mixture-of-Experts models on RTX PRO 6000, RTX 5090, or DGX Spark hardware, this matters directly to your throughput.

The Bottleneck Nobody Talked About

Here’s the issue: NVIDIA’s Blackwell architecture isn’t uniform across product lines.

The B200 datacenter GPU packs 228KB of shared memory per streaming multiprocessor. The SM120 workstation chips—RTX PRO 6000, RTX 5090, DGX Spark—have 99KB. Same architecture family, same marketing materials, very different memory budgets.

For most workloads, this doesn’t matter. But MoE models using NVFP4 quantization hit this wall hard.

Why MoE + NVFP4 Breaks on SM120

Mixture-of-Experts models route each token through a subset of “expert” networks rather than activating all parameters. Qwen3.5-397B, for example, has 397 billion total parameters but only activates 17 billion per forward pass. This makes massive models practical on smaller hardware—until the kernel optimization fails.

NVFP4 is NVIDIA’s 4-bit floating-point format with a two-level scaling system: micro-block scaling for groups of 16 values, plus per-tensor FP32 scaling. It’s designed to preserve accuracy at ultra-low precision while cutting memory usage by 3.5x versus FP16.

The problem: CUTLASS’s optimized GEMM tiles for MoE workloads were designed around K=128 shapes. These tiles require more than 99KB of shared memory at runtime. When they overflow, the GPU falls back to unoptimized kernels that crawl along at a fraction of theoretical throughput.

One developer reported 55 tokens per second on WSL2 with a 4x RTX PRO 6000 setup. For a 397B model, that’s not just slow—it’s practically unusable for interactive applications.

Understanding NVFP4: Why 4-Bit Changes Everything

Before diving deeper into the fix, it helps to understand why NVFP4 matters for this specific problem.

Traditional quantization formats like FP16 or even FP8 keep values in a relatively wide dynamic range. NVFP4 compresses to just 4 bits per weight using an E2M1 structure: 1 sign bit, 2 exponent bits, and 1 mantissa bit. Values range roughly from -6 to +6.

The innovation isn’t the 4-bit representation itself—it’s the scaling strategy. NVIDIA uses a two-level approach:

Micro-block scaling: Every 16 consecutive values share an FP8 (E4M3) scaling factor. This provides fine-grained adaptation to local dynamic range variations within a tensor.

Per-tensor scaling: A global FP32 factor normalizes across the entire tensor, handling the wide range of values typical in neural network weights.

This dual-level approach is what allows NVFP4 to maintain accuracy within 1% of FP16 while using a quarter of the memory bandwidth. But it also means the scaling factors need to be loaded into shared memory alongside the compressed values—which is where the 99KB wall appears.

The Fix: K=64 Tiles and a CUTLASS Patch

The solution came from recognizing that K=64 tile shapes could fit within SM120’s 99KB budget. But CUTLASS had a bug blocking this path.

The TMA (Tensor Memory Accelerator) scale factor layout assumed K≥128, creating a mismatch when K=64. Specifically, the `Blk_SF` parameter expected at least 4 scale factors along the K dimension, but K=64 only provides 2.

The patch modifies `sm120_blockscaled_mma_builder.inl` to:

1. Calculate an effective block size: `EffBlk_SF = min(K/SFVectorSize, Blk_SF)`

2. Fold excess scale factors into the basic block when they exceed MMA requirements

It’s a 20-line fix. The impact is dramatic.

Benchmark Results: From 55 to 282 tok/s

Testing on 4x RTX PRO 6000 Blackwell (96GB GDDR7 each, SM 12.0) with Qwen3.5-397B-A17B-NVFP4:

| Configuration | Single-user tok/s | Cumulative Improvement |

|—————|——————-|————————|

| WSL2 baseline | 55 | — |

| Native Linux | 119 | +116% |

| + MTP=5 + config tuning | 134 | +13% |

| + Driver 595 + CUDA 13.2 | 142 | +6% |

| + K=64 kernel | 283 | +99% |

The full journey from 55 to 283 tok/s represents a 5x improvement. Each step matters, but the kernel fix delivers the single largest gain.

Multi-user throughput gains are even more pronounced:

| Concurrent Users | Before (tok/s) | After (tok/s) | Improvement |

|——————|—————-|—————|————-|

| 1 | 142 | 283 | +99% |

| 4 | 250 | 850 | +240% |

| 8 | 510 | 1,283 | +151% |

At 8 concurrent users, the optimized setup pushes past 1,200 tokens per second—roughly 20,000 tokens per minute. For context, that’s enough to generate a complete technical article in under 10 seconds.

The fix has been submitted to FlashInfer as PR #2786 and is available immediately via a pre-built Docker image.

Implementation: Getting 282 tok/s Yourself

Option 1: Pre-built Docker Image (Recommended)

The fastest path to optimized performance:

“`bash

docker pull verdictai/vllm-blackwell-k64:latest

docker run -d –name vllm –gpus all –ipc host –shm-size 32g \

-p 9200:8000 \

-v /path/to/qwen35-nvfp4:/model:ro \

-e NCCL_P2P_DISABLE=1 \

-e VLLM_WORKER_MULTIPROC_METHOD=spawn \

verdictai/vllm-blackwell-k64:latest \

python3 -m vllm.entrypoints.openai.api_server \

–model /model –served-model-name qwen3.5-397b-nvfp4 \

–host 0.0.0.0 –port 8000 –trust-remote-code \

–tensor-parallel-size 4 –gpu-memory-utilization 0.85 \

–max-model-len 262144 –enable-prefix-caching \

–reasoning-parser qwen3 –enable-auto-tool-choice \

–tool-call-parser qwen3_coder \

–speculative-config ‘{“method”:”mtp”,”num_speculative_tokens”:5}’

“`

Critical Configuration Details

NCCL_P2P_DISABLE=1: AMD-Vi IOMMU can cause page faults with GPU peer-to-peer transfers on Threadripper systems. If you want to try P2P, add `iommu=pt` to kernel boot parameters instead.

Driver 595: Install from NVIDIA’s CUDA repository, not your distro’s package manager. The jump from 580/590 to 595 brings meaningful SM120 improvements.

“`bash

sudo apt install nvidia-open

“`

MTP (Multi-Token Prediction): Use 5 speculative tokens for single-user scenarios, 3 for multi-user. This speculative decoding technique can add another 30-50% throughput gain on top of the kernel fix.

Option 2: Build from Source

If you need customization or want to contribute back:

“`bash

git clone https://github.com/flashinfer-ai/flashinfer

cd flashinfer

git fetch origin pull/2786/head:k64-sm120

git checkout k64-sm120

Build with CUDA 13.2+ and CUTLASS patched

“`

Additional Optimizations That Helped

Beyond the kernel fix, several environment variables improved throughput:

  • `OMP_NUM_THREADS=6` — Avoids oversubscription with tensor parallelism=4 (not 24, which would thrash)
  • `CUDA_DEVICE_MAX_CONNECTIONS=32` — Improves kernel overlap
  • `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` — Reduces memory fragmentation

What This Means for the Industry

The fix highlights an uncomfortable reality: workstation Blackwell GPUs shipped with a performance ceiling that required community intervention to remove.

The Datacenter vs. Workstation Gap

NVIDIA’s documentation emphasizes Blackwell’s unified architecture. What it doesn’t emphasize is that “unified” describes the instruction set, not the memory hierarchy.

B200 datacenter GPUs get 228KB SMEM. SM120 workstation chips get 99KB. The software stack—CUTLASS, TensorRT-LLM, FlashInfer—was optimized for the datacenter configuration first.

This isn’t malicious. Datacenter deployments represent the majority of AI compute spend. But it means workstation users are effectively beta testers for a different memory regime.

The MoE Inference Revolution

MoE architectures are becoming the default for frontier models. DeepSeek-R1 (671B total, 37B active), Qwen3.5-397B (397B total, 17B active), and similar designs all rely on sparse expert routing.

These models are theoretically efficient—activating only 5-10% of parameters per token. But that efficiency assumes optimized kernels. Without them, the sparse routing overhead dominates.

NVIDIA’s own benchmarks show GB200 NVL72 achieving 2.8x throughput improvements on DeepSeek-R1 through software optimizations alone. The company is actively investing in MoE acceleration. But the SM120 fix came from the community, not NVIDIA.

Trade-offs and Limitations

The K=64 fix isn’t universally superior. Smaller tiles mean more memory transactions and slightly lower arithmetic intensity. For models that don’t hit the 99KB wall, K=128 remains optimal.

When to Use K=64

  • MoE models with NVFP4 quantization
  • SM120/SM121 GPUs (RTX PRO 6000, RTX 5090, DGX Spark)
  • Workloads where shared memory overflow forces fallback kernels
  • Scenarios where you see “Failed to initialize cutlass TMA WS grouped gemm” in logs

When K=128 Is Better

  • B200/B100 datacenter GPUs with 228KB SMEM
  • Dense models without MoE routing
  • Non-NVFP4 quantization formats (FP8, FP16)

Known Issues

The community report mentions that some custom chat templates can produce malformed closing think tags. This appears specific to template implementations, not the kernel itself.

Checklist: Is Your Workstation Crippled?

Run through this diagnostic before assuming your hardware is underperforming:

1. Check your GPU compute capability. Run `nvidia-smi –query-gpu=compute_cap –format=csv`. SM 12.0 indicates SM120 architecture with 99KB SMEM.

2. Monitor kernel launches. Enable CUDA logging and look for “Failed to initialize cutlass TMA WS grouped gemm” messages. If present, you’re hitting fallback kernels.

3. Compare against theoretical throughput. Qwen3.5-397B-A17B should exceed 200 tok/s on 4x RTX PRO 6000 with optimized kernels. Below 100 tok/s indicates a problem.

4. Verify driver version. Run `nvidia-smi | grep “Driver Version”`. Anything below 595 is missing SM120-specific fixes.

5. Check your environment. WSL2 adds overhead—native Linux should be 2x faster for the same hardware.

6. Test with the Docker image. If throughput jumps 2x+ after switching to the optimized image, you were hitting the memory wall.

Cost Analysis: What 5x Throughput Means

Let’s put the numbers in context.

At 55 tok/s, generating 1 million tokens takes about 5 hours. At 283 tok/s, it takes 59 minutes. If you’re paying for compute time—or waiting for results—this is the difference between “come back tomorrow” and “done before lunch.”

For a business running MoE inference 24/7, the annual savings from a 5x throughput improvement on a $40,000 workstation rig could exceed $100,000 in opportunity cost and compute time.

The Bigger Picture

This isn’t just about one kernel patch. It’s about the gap between what hardware marketing promises and what software delivers.

NVIDIA sells Blackwell as a unified architecture. In practice, datacenter and workstation variants require different optimization strategies. The community fix bridges that gap for MoE workloads, but it shouldn’t have been necessary.

If you’re deploying MoE models on workstation Blackwell hardware, apply this fix. Then ask your vendor why it wasn’t included in the first place.

The good news: open-source infrastructure like FlashInfer and CUTLASS makes these fixes possible. The better news: once merged, everyone benefits.

FAQ

Q: Does this fix apply to all Blackwell GPUs?

A: No. It specifically targets SM120/SM121 workstation GPUs with 99KB shared memory. Datacenter B200/B100 GPUs with 228KB SMEM don’t need it.

Q: Will NVIDIA merge this into official CUDA/CUTLASS releases?

A: The FlashInfer PR is pending review. NVIDIA has not commented publicly on SM120 shared memory optimization, but their active MoE investment suggests eventual adoption.

Q: Can I use this with models other than Qwen3.5-397B?

A: Yes. Any MoE model using NVFP4 quantization on SM120 hardware should benefit. This includes DeepSeek-R1, future Qwen variants, and any custom MoE architecture.

Q: What’s the performance impact on non-MoE models?

A: Minimal. The K=64 tiles are specifically designed for MoE expert routing patterns. Dense models won’t see significant changes and may even see slight regression due to lower arithmetic intensity.

Q: Is this safe for production?

A: The fix has been validated on 4x RTX PRO 6000 with Qwen3.5-397B, but it’s community-maintained. Test thoroughly with your specific workload before production deployment.

Q: Do I need to recompile vLLM or TensorRT-LLM?

A: No. The Docker image includes the patched FlashInfer kernels. Just pull and run.

References

  • FlashInfer PR #2786: K=64 block-scaled MoE GEMM for SM120 — https://github.com/flashinfer-ai/flashinfer/pull/2786
  • CUTLASS Issue #3096: SM120 shared memory constraints — https://github.com/NVIDIA/cutlass/issues/3096
  • NVIDIA Developer Blog: NVFP4 Quantization for Efficient Inference — https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
  • NVIDIA Developer Blog: MoE Inference on Blackwell — https://developer.nvidia.com/blog/delivering-massive-performance-leaps-for-mixture-of-experts-inference-on-nvidia-blackwell/
  • Qwen3.5-397B Model Card — https://huggingface.co/Qwen/Qwen3.5-397B-A17B
  • Reddit r/LocalLLaMA: Original community report — https://www.reddit.com/r/LocalLLaMA/comments/1rtrdsv/55_282_toks_how_i_got_qwen35397b_running_at_speed/
  • NVIDIA CUDA Blackwell Tuning Guide — https://docs.nvidia.com/cuda/blackwell-tuning-guide/
  • Alibaba Qwen3.5 Technical Blog — https://qwen.ai/blog?id=qwen3.5