mixture-of-experts Archives

MoE Inference Costs 8.6x GPU Memory of Dense Models

June 19, 2026 0 Comments

In MoE inference, a 37B-active model can demand roughly 8.6× the GPU memory of a dense model with equivalent per-token compute, because every expert’s weights must stay resident in VRAM even when only a fraction fire on any given token. That single number is why your DeepSeek-V3 serving footprint needs …

Editorial team