Quantization Halved Our 70B LLM Inference Cost in 2026

A 70B-parameter model in FP16 burns roughly 140 GB of VRAM just to hold its weights. Compress those weights to 4-bit integers and the footprint collapses to about 35 GB — small enough to fit on a single 80 GB GPU with room left for the KV cache. That fourfold shrink is not a parlour trick; it is the difference between a two-GPU cluster and a one-GPU node, and it is why 4-bit quantization is now the default posture for anyone serving large language models (Lyceum Technology). The only real question in 2026 is which quantization format to ship: FP8, AWQ, GPTQ, or GGUF.

The VRAM Math Makes Quantization Mandatory

During autoregressive decoding, LLM inference is memory-bandwidth bound, not compute bound. The streaming multiprocessors sit idle waiting for weights to cross the memory bus. Cut the weight bytes by 75 percent and you move more data per clock, which directly raises tokens per second — quantization buys you speed as well as capacity. A 70B model drops from ~140 GB to ~35 GB at INT4, freeing bandwidth and VRAM for concurrent KV cache (Lyceum Technology).

The practical end-to-end savings are closer to 2x than 4x once you account for activations and KV cache, but that is still the difference between overpaying for an H100 and shipping on an A100. Vendors put the gap bluntly: running an FP16 70B on an H100 when an A100 serving AWQ INT4 will do means overpaying by roughly 2x (Spheron). Given H100 spot rates have fallen from $8/hr toward $3/hr and Blackwell parts are entering production, the format choice compounds across every node (Zylos Research).

FP8 Is the Hopper Production Default

If you serve on NVIDIA H100, H200, or B200 hardware, FP8 should be your starting point. It is near-lossless — typical perplexity increase is 0.1–0.3 percent — and on Hopper it runs roughly 33 percent faster than FP16 while halving the KV-cache footprint, because the Transformer Engine handles FP8 natively (Zylos Research). Benchmarks on Mistral 7B show a 33 percent tokens/sec gain and 31 percent higher throughput against FP16, with TTFT dropping 8.5 percent.

The deployment story is trivial. In vLLM it is two flags — quantization="fp8" and kv_cache_dtype="fp8" — and you get hardware-accelerated inference with negligible quality drift (Zylos Research). For long-context workloads the FP8 KV cache is the real prize: it doubles the context you can hold per GPU. FP8 is the format you reach for when you cannot tolerate any accuracy regression and you have Hopper-class silicon. Format choice is also inseparable from serving-engine choice — the quantization kernels only matter inside the runtime that executes them, so see our H100 benchmarks comparing vLLM, TensorRT-LLM, and SGLang before you commit.

AWQ Won INT4 on Salient Weights

Below 8-bit, AWQ (Activation-aware Weight Quantization, from the MIT Han Lab) is now the expected INT4 format for production. The insight is simple and powerful: not all weights matter equally. A small slice of weight channels — roughly 1 percent — produce high-magnitude activations and disproportionately drive model outputs. AWQ finds those salient channels via a calibration pass, then protects them with per-channel scaling while aggressively compressing the rest (Spheron).

The payoff is measurable. AWQ INT4 retains about 95 percent of FP16 quality at the same bit-width, consistently beating GPTQ on quality benchmarks, and it needs no backpropagation — the calibration pass over 128–512 samples is enough (Spheron; Zylos Research). By 2026, major model families ship pre-quantized AWQ checkpoints on Hugging Face, and vLLM, SGLang, and TensorRT-LLM all ship optimized AWQ kernels. For creative writing, code generation, and large 70B+ models where nuance matters, AWQ is the right INT4 pick. It delivers 2–3x inference speedup on compatible hardware with no retraining required (Index.dev).

GPTQ Trades Quality for Raw Throughput

GPTQ is the older INT4 method — a 2022 layer-wise second-order weight optimizer — and its quality edge has eroded. It retains roughly 90 percent of FP16 quality, a few points behind AWQ, and it carries a calibration tax: quality depends heavily on the dataset you calibrate against, and overfitting the calibration set can quietly degrade reasoning (Zylos Research; Lyceum Technology).

What GPTQ still wins is throughput. The weights are not the story — the kernels are. Pair GPTQ weights with the Marlin CUDA kernels and the same model runs 2.5x faster than baseline GPTQ, often making it the highest-throughput INT4 option on NVIDIA GPUs (Zylos Research). The lesson generalises: for quantized serving, kernels matter more than the quantization algorithm. If your workload is high-volume, latency-tolerant chat or classification where a point or two of accuracy is acceptable, GPTQ-with-Marlin is a legitimate throughput play. For anything that needs to reason, prefer AWQ.

GGUF Belongs on the Edge

GGUF is a different animal: a single-file format with embedded metadata, native to llama.cpp and Ollama, designed for CPU and Apple Silicon with flexible CPU/GPU offloading. It supports a wide band of quantization levels — Q2_K through Q8_0 — and on Apple Silicon it shines because unified memory lets it hold large models without discrete VRAM (Zylos Research).

But GGUF is the wrong choice for high-concurrency GPU serving. It carries overhead in vLLM and bottlenecks under concurrent request loads, where you should stay on llama.cpp or Ollama — neither of which is a production serving tier (Lyceum Technology). The sweet spots are Q4_K_M for a good size/quality balance and Q5_K_M when you want better quality and still need speed; Q8_0 is near-lossless but defeats much of the compression. Use GGUF for local development, edge deployment, and prototyping — then move to AWQ or FP8 when you put the same model in front of real traffic.

A Decision Framework by Workload

The formats are not interchangeable; each maps to a hardware and workload niche. The table below condenses the 2026 consensus on which format fits which job.

WorkloadBest FormatWhy
GPU serving on Hopper/Blackwell, no accuracy budgetFP8~99% quality, 33% faster, native hardware support (Zylos Research)
INT4 serving where nuance matters (code, creative, 70B+)AWQ~95% quality, salient-weight protection, no retraining (Spheron)
Max throughput, tolerance for accuracy lossGPTQ + Marlin2.5x kernel speedup, highest raw req/s (Zylos Research)
Local, edge, CPU, Apple SiliconGGUF (Q4_K_M / Q5_K_M)Single-file, flexible offload, unified-memory friendly (Zylos Research)
Long-context servingFP8 (KV cache)Doubles context per GPU via FP8 KV cache (Zylos Research)

Two engineering rules fall out of this. First, kernels beat algorithms: the Marlin kernels make GPTQ competitive, and FP8 wins because Hopper executes it in silicon, not because the math is novel. Second, never trust quoted bit-widths in isolation — measure the full footprint (weights + activations + KV cache), because the real savings are roughly half the weight-compression ratio once you account for everything else in the serving stack.

References

Tags: 2083208320832083