Fine-tuning
Learn when fine-tuning outperforms prompting, how LoRA makes it affordable, and how to decide between full fine-tuning, LoRA, QLoRA, and instruction tuning for your use case.
TL;DR
- Fine-tuning continues training a pretrained model on your own dataset, nudging its weights toward your specific task, format, or domain vocabulary.
- Try RAG and good prompting first. Fine-tune only when you need consistent format, domain vocabulary, or cost/latency reduction via a smaller specialized model.
- LoRA trains under 0.1% of parameters and matches full fine-tuning quality on most tasks. QLoRA fits 70B model training on a single 40GB A100.
- Data quality is the primary lever. 1K curated examples routinely outperform 100K noisy ones.
- A held-out evaluation set is non-negotiable. Without one, you cannot tell whether fine-tuning helped or just changed the failure mode.
- The economics drive adoption: a fine-tuned 7B model can match GPT-4 on your task at 100x lower cost and 10x lower latency.
The problem it solves
General-purpose models know everything about everything, which means they know nothing specifically about your product. Ask GPT-4 to draft a support reply in your company's voice and you get something competent but generic. It does not know your tone, your internal terminology, or the specific way you acknowledge customer frustration before pivoting to solutions.
You could put all of that context into the system prompt. Sometimes that works. But prompts get long, long prompts cost money at inference, and the model treats every instruction as a temporary hint rather than internalized behavior.
The other driver is cost and latency. A fine-tuned 7B model running on your own hardware can match GPT-4 on your specific task at 100x lower inference cost and 10x lower latency. I've watched teams run the math and realize that a $500 training run pays for itself in the first week of production traffic.
Here is the cost breakdown that makes the case:
| Factor | GPT-4 API | Fine-Tuned 7B (Self-Hosted) |
|---|---|---|
| Per-query cost | ~$0.01-0.03 | ~$0.0003 |
| Latency (p50) | 800-2000ms | 50-200ms |
| Monthly cost at 1M queries | $10K-30K | $500-2K (GPU rental) |
| Data privacy | Sent to third party | Stays on your infra |
| Customization | Prompt only | Weights + prompt |
That math is hard to argue with. When you need a model that feels like it belongs to your product rather than borrowing time from a general assistant, fine-tuning is the answer.
What is it?
Fine-tuning means continuing the training process on a pretrained base model using your own dataset. The model's weights, which already encode language understanding from pretraining on trillions of tokens, get nudged toward the specific behavior you want.
Think of it like hiring an experienced chef and teaching them your restaurant's recipes. You do not teach them how to cook from scratch. They already know knife skills, heat control, and flavor balance. You just show them your specific dishes, your plating style, and your secret sauce. That is fine-tuning: specialization on top of deep general competence.
There are several distinct approaches under the fine-tuning umbrella: full fine-tuning, LoRA, QLoRA, and instruction fine-tuning. Choosing between them is mostly a hardware and cost decision, not a quality decision. For your interview: say "fine-tuning adapts a pretrained model's weights to a specific task using domain data" and you have the definition nailed.
How it works
Full fine-tuning
Full fine-tuning updates every parameter in the model using standard gradient descent on your dataset. For a 7B parameter model, you are touching 7 billion floats. In fp16 precision, that means roughly 14GB just for the weights, plus optimizer states (Adam stores two additional copies), pushing total VRAM to 80GB or more.
This approach has the highest quality ceiling because every weight can shift to fit your task. It also carries the highest risk of catastrophic forgetting, where the model loses general capabilities while over-specializing on narrow data.
The training loop is conceptually simple: take a batch of (input, expected_output) pairs, compute the loss between the model's prediction and the expected output, then backpropagate gradients through every layer. The entire parameter space shifts with each step. This is the same process as pretraining, just on your smaller dataset with a lower learning rate (typically 1e-5 to 5e-5).
I've seen teams default to full fine-tuning because it feels "complete." In practice, the cost and forgetting risk make it the wrong choice for most use cases. Reserve it for when you have abundant data, large GPU budgets, and a base model under 13B parameters.
LoRA (Low-Rank Adaptation)
LoRA freezes all base model weights and inserts small trainable matrices into each attention layer. Instead of updating a weight matrix W directly, LoRA adds a low-rank decomposition: delta_W = A x B, where A and B are much smaller matrices. The rank hyperparameter (typically 4 to 64) controls the capacity of these adapters.
The key insight is that the weight updates needed for task adaptation live in a low-dimensional subspace. A full-rank update is wasteful. LoRA typically trains under 0.1% of total parameters, produces results competitive with full fine-tuning on most NLP tasks, and costs roughly 10x less to run.
LoRA adapters are small files (tens of megabytes) that you merge into the base model at serving time or swap between tasks. This makes multi-tenant fine-tuning practical: one base model, many adapters.
# Simplified QLoRA config (using PEFT + bitsandbytes)
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank: 4-64, higher = more capacity
lora_alpha=32, # Scaling factor, typically 2x rank
target_modules=["q_proj", "v_proj"], # Which layers get adapters
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
print(f"Trainable params: {model.num_parameters(only_trainable=True):,}")
# Output: Trainable params: 4,194,304 (vs 7B total)
QLoRA (Quantized LoRA)
QLoRA combines 4-bit NF4 quantization of the base model with LoRA adapters trained in fp16. The base weights are compressed to 4-bit precision (cutting VRAM by roughly 4x), while the adapter computations remain in full precision to preserve gradient quality.
This fits a 70B model onto a single 40GB A100 GPU for training. QLoRA is the default for on-prem teams and anyone who does not have a multi-GPU cluster. I recommend QLoRA as the starting point for any fine-tuning project unless you have a specific reason to do otherwise.
The quality trade-off is smaller than you would expect. The QLoRA paper showed results within 1-2% of full fine-tuning on most benchmarks, while using a fraction of the compute.
Comparing the approaches
Interview tip: know the cost ladder
In interviews, frame this as a cost-quality ladder. Start with QLoRA (cheapest, good quality). If quality is insufficient after data curation, scale to LoRA with a larger base model. Full fine-tuning is the last resort, not the first.
Instruction fine-tuning
Instruction fine-tuning trains a base model on (instruction, response) pairs so it learns to follow directions. This is how raw GPT-3 became InstructGPT, and how Llama 2 Base became Llama 2 Chat. Without instruction tuning, base models autocomplete text rather than answering questions.
The data format is critical. For single-turn tasks, use prompt-completion pairs. For chat models, use the multi-turn format with system/user/assistant roles matching the model's expected chat template. Mismatched formats are the most common source of poor results that do not obviously look like overfitting.
For your interview: if someone asks "how does a base model become a chat model," the answer is instruction fine-tuning plus RLHF. Say that in one sentence and move on.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.