MoE Inference Costs 8.6x GPU Memory of Dense Models

In MoE inference, a 37B-active model can demand roughly 8.6× the GPU memory of a dense model with equivalent per-token compute, because every expert’s weights must stay resident in VRAM even when only a fraction fire on any given token. That single number is why your DeepSeek-V3 serving footprint needs …