Model distillation
Learn how knowledge distillation transfers capability from large teacher models to smaller student models, when it beats fine-tuning, and how it powers DeepSeek and Phi.
TL;DR
- Distillation trains a small "student" model to mimic a large "teacher" model's output distribution, not just its final labels. Soft probability outputs carry far more signal than hard labels.
- Temperature scaling controls how much the teacher reveals about its uncertainty. Higher temperature = softer distributions = more information for the student.
- Response distillation (generate with the teacher, train the student on outputs) is the production standard because it works with closed API models like GPT-4o and Claude.
- DeepSeek R1's 7B/14B/32B variants were distilled from R1-Zero. Phi-4 (14B) was distilled from GPT-4o synthetic data and outperforms many 70B models on reasoning benchmarks.
- Distillation transfers reasoning capability; fine-tuning only adapts style and format. That distinction alone will separate you in interviews.
The problem it solves
GPT-4 reasons well. Claude Opus reasons well. Running either at high volume costs enough that the inference bill becomes the limiting factor, not the engineering. At 100K+ daily API calls, even a 5x cost reduction can mean the difference between a profitable product and one that burns cash.
The obvious next step is to use a smaller model. But small models trained from scratch on raw internet text don't inherit the reasoning patterns that make large models useful. They know facts but they can't chain logical steps the way a 70B model can.
I've watched teams try to solve this by fine-tuning a 7B model on their domain data, only to discover that fine-tuning doesn't create new reasoning ability. It adapts what the model already knows to a new format. If the underlying capability isn't there, no amount of fine-tuning will conjure it.
Here's the scale of the problem at typical production volumes:
| Use case | Daily calls | GPT-4o cost/month | Distilled 7B cost/month | Savings |
|---|---|---|---|---|
| Customer support chatbot | 50K | $2,250 | $300 | 87% |
| Code review assistant | 200K | $9,000 | $1,200 | 87% |
| Document summarization | 500K | $22,500 | $3,000 | 87% |
| Real-time content moderation | 2M | $90,000 | $12,000 | 87% |
The fundamental tension: you want large-model quality at small-model cost. Distillation is the most direct path to closing that gap.
What is it?
Knowledge distillation (Hinton et al., 2015) is a training technique where a small model (the student) is trained to match the output distribution of a large model (the teacher), rather than learning from ground-truth labels directly.
Think of it like an experienced chef teaching an apprentice. A recipe book (hard labels) says "add salt." The chef (teacher) says "add salt, but notice how the tomato sauce reacts, and if you had used cumin instead, the flavor profile would shift towards X." The apprentice learns the relationships between ingredients, not just the steps.
The teacher's soft probability output encodes these relationships. When it predicts "dog: 0.7, wolf: 0.2, cat: 0.1," the label only says "dog." The soft distribution says "this looks a lot like a wolf too, which means certain visual features matter." The student captures those inter-class relationships that hard labels discard completely.
How it works
Soft labels vs hard labels
The core insight of distillation is that hard labels are information-lossy. A one-hot vector [1, 0, 0] for "dog" throws away everything the teacher learned about how classes relate to each other.
Soft labels preserve that structure. The teacher's output probabilities [0.7, 0.2, 0.1] encode what linguists call "dark knowledge": the probability mass assigned to incorrect classes reveals which mistakes are reasonable and which are absurd. A model that assigns 0.2 to "wolf" and 0.001 to "airplane" is telling the student something important about feature similarity.
I think of soft labels as a compressed version of the teacher's entire internal representation. They carry orders of magnitude more information per training example than hard labels.
Temperature scaling
Raw model outputs (logits) are often very peaked: one class has probability 0.99, everything else is near zero. That's not useful for distillation because it's almost identical to a hard label.
Temperature scaling softens the distribution by dividing logits by a temperature parameter T before applying softmax:
# Standard softmax (temperature = 1)
probs = softmax(logits) # [0.99, 0.008, 0.002]
# Softened with temperature = 5
probs = softmax(logits / 5) # [0.65, 0.22, 0.13]
# Higher temperature = softer, more informative distribution
probs = softmax(logits / 10) # [0.48, 0.30, 0.22]
At T=1, the distribution is sharp and most dark knowledge is hidden. At T=5-20, the distribution is smoother and inter-class relationships become visible. Hinton's original paper used T=20 for some experiments.
The sweet spot is typically T=3-10 for LLMs. Too low and you don't get enough dark knowledge. Too high and the distribution becomes nearly uniform, washing out useful signal.
The loss function
The student's training loss combines two terms:
- Soft label loss: KL divergence between the teacher's softened distribution and the student's softened distribution (both at temperature T). This transfers the dark knowledge.
- Hard label loss: Standard cross-entropy between the student's output and the ground-truth label. This keeps the student grounded in factual correctness.
# Simplified distillation loss
loss = (
alpha * T * T * kl_divergence(
softmax(teacher_logits / T),
softmax(student_logits / T)
)
+ (1 - alpha) * cross_entropy(student_logits, hard_labels)
)
# alpha typically 0.5-0.9 (weight toward soft labels)
# T^2 factor compensates for gradient magnitude reduction at high T
The T-squared factor is a detail that matters: when you raise temperature, gradient magnitudes shrink by 1/T-squared. Multiplying the soft loss by T-squared restores the gradient scale so both loss terms contribute meaningfully.
Step-by-step distillation
For reasoning tasks (math, code, logical chains), standard response distillation has a problem: the teacher's final answer is correct, but the student doesn't learn why.
Step-by-step distillation (Hsieh et al., 2023) fixes this. The teacher generates both the reasoning trace and the answer. The student trains on the full chain-of-thought, learning the intermediate steps that lead to the final output.
This is how DeepSeek built their R1 smaller variants. R1-Zero (the full teacher) generates chain-of-thought reasoning traces. The 7B/14B/32B students train on those traces, inheriting multi-step reasoning they couldn't learn from final answers alone.
The training process end-to-end
Putting it all together, a typical distillation workflow looks like this:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.