Reinforcement learning from human feedback
Understand how RLHF turns a capable base LLM into a safe, helpful assistant, what reward models and PPO do, and why DPO has largely replaced PPO as the practical engineering choice.
TL;DR
- Base LLMs autocomplete text well but are not safe, helpful, or aligned with human intent. RLHF is the post-training pipeline that fixes this.
- The classic pipeline has three stages: supervised fine-tuning (SFT), reward model training on human preference pairs, and PPO optimization against that reward model.
- PPO is complex, expensive, and unstable. DPO (Direct Preference Optimization) replaced it for most teams by training directly on preference pairs with a simple classification loss, no reward model needed.
- InstructGPT used PPO. Llama 2 used PPO. Most open-weight models since 2024 use DPO or its variants (KTO, ORPO).
- There is a real alignment tax: RLHF-aligned models are marginally less capable on raw benchmarks than their unaligned base counterparts, but dramatically more useful in practice.
- For your interview: know the three-stage pipeline, explain why DPO won, and be ready to discuss reward hacking.
The problem it solves
A freshly pretrained base LLM has one job: predict the next token given the previous tokens. Ask "How do I make a bomb?" and it produces instructions, not because it's malicious, but because that's what follows that sequence in training data. Ask "What year did Einstein win the Nobel Prize?" and it might confidently say 1915 (it was 1921). It doesn't follow instructions. It completes them.
You can't deploy that to users. OpenAI faced this exact problem when building ChatGPT on top of GPT-3. The model was technically impressive, but the raw output was dangerous, unreliable, and often unhelpful. The gap between "predicts next token well" and "helpful, harmless, honest assistant" was enormous.
Think of it like hiring a brilliant but unfiltered intern. They know everything in the textbooks, they can write fluently, but they haven't learned workplace norms. They'll share confidential information if you ask nicely. They'll give you medical advice with complete confidence. They'll write offensive jokes if that's what the prompt implies. The knowledge is there, but the judgment isn't.
I've seen teams try to solve this with prompt engineering alone ("You are a helpful assistant. Never say anything harmful."). It works for about 80% of cases and fails catastrophically on the remaining 20%. The model needs deeper behavioral change than a system prompt can provide.
RLHF is the pipeline that closes the gap. It uses human preference signals to teach the model what "good" looks like, then optimizes the model to produce outputs humans actually prefer. It is the reason ChatGPT feels like an assistant rather than an autocomplete engine.
Here's a concrete example. Ask a base model "Summarize this legal document in plain English" and it might: (1) continue writing MORE legal text instead of summarizing, (2) produce a summary but include fabricated clauses, or (3) summarize perfectly but add unsolicited legal advice. The model doesn't know what "helpful" means in this context. RLHF teaches it by showing many examples of what humans prefer when they ask for a summary.
What is it?
RLHF (Reinforcement Learning from Human Feedback) is a post-training alignment technique that takes a capable base model and fine-tunes it to follow instructions, refuse harmful requests, and produce useful answers, shaped by human judgment rather than just next-token prediction loss.
OpenAI introduced the approach in the InstructGPT paper (March 2022). Ouyang et al. showed that a 1.3B parameter model trained with RLHF was preferred by human raters over the raw 175B GPT-3 model. A model 100x smaller, aligned with human preferences, beat a vastly larger model that wasn't aligned.
That result turned alignment from a research curiosity into an engineering priority.
The analogy that lands: RLHF is like a performance review loop. The model (employee) does work. Humans (managers) compare outputs and say which is better. The model adjusts its behavior based on that feedback. Over thousands of iterations, the model converges on behavior that humans prefer, just as an employee learns what the company values through repeated feedback.
One important nuance: RLHF doesn't teach the model new knowledge. It changes the model's behavior to surface knowledge it already has in ways humans find helpful. A base model "knows" how to answer a question politely. It also "knows" how to continue toxic text. RLHF amplifies the helpful behaviors and suppresses the harmful ones. This is why alignment researchers describe RLHF as steering, not teaching.
The technique is now foundational. ChatGPT, Claude, Gemini, Llama 2, and every commercial chat model goes through some version of RLHF or its descendants. Understanding it is not optional if you work with AI systems.
The transformation is dramatic. Before RLHF, you get a model that might answer "What's the capital of France?" with "France is a country in Western Europe known for its wine..." (it's continuing text, not answering a question). After RLHF, you get "The capital of France is Paris." Same model, same knowledge, completely different behavior.
How it works
The three-stage pipeline
The original InstructGPT paper established a three-stage pipeline that became the standard approach. Every major alignment effort since has been either a direct implementation of these stages or a simplification of them.
The stages are sequential and each builds on the previous one: (1) Supervised Fine-Tuning (SFT), (2) Reward Model training, (3) PPO optimization. The entire pipeline typically takes 2-4 weeks of compute for a frontier model and requires a team of human annotators.
My recommendation for understanding this: don't think of it as three separate techniques. Think of it as a single pipeline where each stage addresses a specific weakness of the previous one. SFT teaches format. The reward model captures preferences. PPO optimizes for those preferences while maintaining stability.
Stage 1: Supervised fine-tuning (SFT)
Human annotators write ideal responses to a diverse set of prompts. The base model is fine-tuned on these (prompt, ideal_response) pairs using standard cross-entropy loss. This teaches the model the basic shape of helpful conversation: follow instructions, answer questions, refuse clearly harmful requests.
For InstructGPT, OpenAI used about 13,000 demonstration examples from a team of 40 contractors. That's a small dataset by pretraining standards, but the quality bar was high. Each demonstration showed the model what a "good" response looked like for a specific type of prompt.
The prompts covered a wide range of tasks: open-ended generation ("Write a story about..."), summarization, Q&A, rewriting, classification, brainstorming, and code. This diversity matters because the SFT model needs to generalize across task types, not just memorize response templates for one category.
The SFT model is sometimes called the "reference model" or "reference policy" in later stages. It represents the baseline behavior that PPO will optimize from. This model is already noticeably better than the base model at following instructions, but it still produces mediocre responses for many prompts because it was only trained to imitate the annotators, not to optimize for what users actually prefer.
For your interview: SFT alone gets you maybe 70% of the way to an aligned model. The remaining 30% (handling edge cases, refusing creatively phrased harmful requests, being genuinely helpful rather than just safe) requires the reward model and policy optimization.
Stage 2: Reward model training
The SFT model generates two or more responses to the same prompt. Human raters compare the responses and pick the better one. These A-is-better-than-B preference pairs are used to train a separate reward model.
The reward model is typically another LLM (often the same architecture as the SFT model) with the final layer replaced by a scalar output head. Given a (prompt, response) pair, it outputs a single number representing how "good" the response is. It learns to assign higher scores to responses that humans preferred and lower scores to responses they rejected.
The training objective is a Bradley-Terry pairwise ranking loss: given a preferred response and a rejected response, maximize the probability that the reward model scores the preferred one higher. In practice, the loss function looks like:
# Simplified reward model training loss
loss = -log(sigmoid(reward(preferred) - reward(rejected)))
For InstructGPT, OpenAI collected about 33,000 comparison pairs. Llama 2 used over 1 million preference pairs across multiple rounds of collection. The quality and diversity of these pairs is the single biggest predictor of alignment quality.
Annotator agreement on these comparisons is typically 70-80%, which means 20-30% of the time, different annotators disagree on which response is better. This noise is acceptable because the reward model learns the aggregate preference signal, not any individual annotator's opinion. But it also means the reward model has a ceiling: it cannot learn preferences that humans themselves don't consistently agree on.
I've found that this is the stage most candidates skip in interviews. They'll mention RLHF as a buzzword but can't explain what the reward model actually learns or how it's trained. Knowing the Bradley-Terry loss and the data collection process signals genuine understanding.
Stage 3: PPO optimization
Now comes the reinforcement learning. The SFT model (now called the "policy") generates a response to a prompt. The reward model scores it. PPO (Proximal Policy Optimization) updates the policy's weights to generate higher-scoring responses.
PPO is an RL algorithm from Schulman et al. (2017) that makes small, stable updates. It clips the policy update to prevent the model from changing too much in a single step. This stability matters because language models are enormous and a single bad update can catastrophically degrade performance.
Each PPO iteration:
- Sample a batch of prompts
- Generate completions from the current policy
- Score them with the reward model
- Compute KL divergence between the policy and the reference (SFT) model
- Compute the final reward:
reward_model_score - beta x KL_divergence - Update the policy using the clipped PPO objective
The process repeats for thousands of iterations. The policy gradually learns to generate responses that score well on the reward model while staying close to the SFT baseline.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.