Quantization
Learn how quantization reduces LLM memory footprint by 4-8x, what INT4 and GGUF mean in practice, and how to run 70B models on consumer hardware without quality collapse.
TL;DR
- Quantization reduces model weight precision from 32-bit floats to 8-bit or 4-bit integers, cutting memory usage by 4-8x with minimal quality loss.
- A Llama 3 70B model in FP16 needs 140GB VRAM (two A100s). Quantized to INT4, it fits in ~40GB (one A100). That's the difference between a $5/hour cluster and a $2/hour single GPU.
- INT8 quantization loses less than 1% quality on standard benchmarks. INT4 loses 1-5%. Below INT4 the tradeoff usually isn't worth it.
- Three formats dominate: GGUF (CPU/local via llama.cpp and Ollama), GPTQ (GPU, layer-by-layer error minimization), AWQ (GPU, activation-aware, increasingly the default).
- QLoRA quantizes the base model to 4-bit and adds trainable LoRA adapters in FP16, making 70B fine-tuning possible on a single 40GB GPU.
- The engineering decision: quantization is the fastest path from "we can't afford to serve this model" to "it's running in production."
The problem it solves
You've found the perfect open-weights model for your use case. Llama 3 70B crushes your evaluation benchmarks, handles your domain terminology, and your team is ready to deploy. Then you check the hardware requirements.
A 70B parameter model in FP16 (the standard training precision) needs roughly 140GB just to load the weights into memory. A single NVIDIA A100 has 80GB. So you need at least two A100s in a tensor-parallel configuration, which costs $5-8/hour on AWS or GCP. For a startup running 24/7 inference, that's $3,600-5,800/month, and that's before you account for KV cache memory, concurrent requests, or redundancy.
I've seen teams go through this exact math and either downgrade to a 7B model (losing 15-20% quality on their benchmarks) or decide to use a hosted API (losing control over latency, cost per token, and data privacy). Neither option is great.
The root cause is simple: each model parameter is stored as a 16-bit floating-point number. A 70B model has 70 billion of these. 70 billion x 2 bytes = 140GB. The precision of those numbers is far higher than what inference actually needs.
The bottom line: most model parameters carry far more precision than inference needs, and you're paying real money for that wasted precision every hour the model runs.
Here's the scale of the problem across common model sizes:
| Model | Parameters | FP16 Memory | INT8 Memory | INT4 Memory | GPUs needed (FP16) | GPUs needed (INT4) |
|---|---|---|---|---|---|---|
| Llama 3 8B | 8B | 16GB | 8GB | ~5GB | 1x consumer GPU | 1x laptop GPU |
| Llama 3 13B | 13B | 26GB | 13GB | ~8GB | 1x A100 | 1x consumer GPU |
| Llama 3 70B | 70B | 140GB | 70GB | ~40GB | 2x A100 | 1x A100 |
| Llama 3 405B | 405B | 810GB | 405GB | ~230GB | 10+ A100s | 3x A100 |
What is it?
Quantization is the process of reducing the numerical precision of a model's weights from high-bit floating-point numbers (FP32 or FP16) to lower-bit integers (INT8 or INT4). Each parameter takes less storage, so the entire model fits in less VRAM and transfers faster through the memory bus.
Think of it like converting a high-resolution photo to a JPEG. The raw image might be 50MB, but a well-compressed JPEG at 90% quality is 5MB, and your eyes can barely tell the difference. Below 60% quality, things get noticeably worse. Quantization works the same way: there's a large compression range where quality loss is negligible, and a threshold below which things degrade fast.
A 70B model in FP16 takes 140GB. In INT8 that drops to 70GB. In INT4 it's roughly 35-40GB. That single transition (FP16 to INT4) is what makes a 70B model deployable on a single consumer-grade GPU instead of a multi-GPU cluster. That's the jump that changed the open-source LLM ecosystem.
The tradeoff is precision. You're mapping continuous floating-point values into a small set of discrete integers. Modern quantization methods (GPTQ, AWQ, GGUF variants) minimize this mapping error carefully, but some quality is always lost. The engineering question is: how much quality can you afford to lose?
How it works
Numeric precision: the precision ladder
Every number in a model has a bit-width that determines how precisely it can represent a value. Here's the precision ladder LLMs typically move down:
| Precision | Bits per weight | Memory per 70B model | Typical use |
|---|---|---|---|
| FP32 | 32 | 280GB | Training (legacy) |
| FP16 / BF16 | 16 | 140GB | Standard training and serving |
| INT8 | 8 | 70GB | High-quality compressed serving |
| INT4 | 4 | 35-40GB | Production serving on limited hardware |
| INT2 / 1-bit | 2 / 1 | 18GB / 9GB | Research only, significant quality loss |
FP32 stores each weight as a 32-bit IEEE 754 float with full mantissa precision. FP16 halves that to 16 bits, and BF16 (Brain Float 16) uses the same 16 bits but allocates more to the exponent range, which works better for the value distributions in neural networks. These are lossless relative to training quality because models are trained in FP16/BF16 already.
The real compression starts at INT8: you're converting floats to 8-bit integers. Below INT4, quality degrades sharply for most models above 7B parameters. I've rarely seen INT2 or 1-bit work outside of research papers.
Here's the full quantization pipeline, from trained model to deployed endpoint:
Post-training quantization (PTQ)
PTQ is the most common approach: take an already-trained model and convert its weights to lower precision without any retraining. You're compressing after the fact.
The simplest version is uniform quantization: find the min and max of a weight tensor, divide that range into $2^n$ buckets (where $n$ is your target bit-width), assign each weight to its nearest bucket, and store the bucket index plus a scale factor. At inference, multiply the integer by the scale factor to recover an approximate float.
The problem with uniform quantization across an entire tensor is that weight distributions aren't uniform. Some layers have outliers 100x larger than the median. If you set your scale based on those outliers, you waste most of your integer range on values that never appear.
Modern PTQ methods fix this with per-group quantization: instead of one scale factor per tensor, compute separate scale factors for small groups of weights (typically 32 or 128). Each group gets its own min/max range, so outliers in one group don't waste precision in another.
The calibration step is crucial. PTQ methods run a small calibration dataset (typically 128-512 samples) through the model to measure activation patterns and weight distributions. This statistical profile guides the quantization decisions. Poor calibration data leads to higher-than-expected quality loss.
Quantization-aware training (QAT)
QAT takes a different approach: simulate quantization during training itself. The model learns to produce good outputs despite the reduced precision, effectively adapting its weights to be more "quantization-friendly."
During forward passes, weights are quantized (simulated) to the target precision. During backward passes, gradients flow through the quantization step using straight-through estimators (treating quantization as an identity function for gradient purposes). The model adjusts its weights so that the quantized versions still produce good outputs.
QAT consistently produces better quality at the same bit-width compared to PTQ, typically 0.5-1% better on benchmarks. The cost: you need to retrain (or continue-train) the model, which requires the original training infrastructure.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.