Reasoning models
Learn how reasoning models like o1, o3, and DeepSeek R1 use extended chain-of-thought to dramatically outperform standard LLMs on complex tasks, and when the extra cost is justified.
TL;DR
- Reasoning models generate an extended internal thinking trace (often 500-5,000 tokens) before producing a visible answer, trading latency and cost for dramatically better accuracy on hard problems.
- o1 scored 89th percentile on Codeforces. GPT-4o scored 11th. On the 2024 IMO qualifier, o1 solved 83% vs GPT-4o's 13%.
- The key insight: you can improve accuracy by spending more compute at inference time, not just by training bigger models. This is inference-time compute scaling.
- Training uses RL with outcome-based rewards on verified reasoning traces, not just supervised fine-tuning.
- Use reasoning models for complex math, multi-step code generation, and scientific reasoning. Don't use them for simple Q&A, classification, or latency-sensitive features.
- Cost reality: 5-50x more expensive per query, 10-60 seconds time-to-first-token. Most queries don't need it.
The problem it solves
Ask GPT-4o to solve a competition math problem and it gives you a fluent, confident, wrong answer. The model has one forward pass to produce a response. It pattern-matches from training data rather than systematically working through the logic.
This isn't a fluke. On the AIME 2024 (American Invitational Mathematics Examination), GPT-4o solves about 12% of problems. On GPQA Diamond (graduate-level science questions written by PhD researchers), GPT-4o scores around 53%. The model is "smart enough" in the sense that it knows the relevant math and science. It just can't chain 5-10 reasoning steps together reliably in a single pass.
I've seen this pattern repeatedly: the model knows each individual fact needed for the answer, but it can't compose them under pressure. It's like a chess player who knows every rule but only looks one move ahead.
The same wall appears in code generation, formal logic, constraint satisfaction, and any domain where the answer depends on a chain of intermediate deductions. Direct token generation fails wherever the gap between question and answer requires systematic exploration and backtracking.
Here's a concrete example. Ask GPT-4o: "A farmer has 17 sheep. All but 9 die. How many are left?" It often answers 8 (17 minus 9). The correct answer is 9 ("all but 9" means 9 survive). The failure isn't about math, it's about parsing the problem structure. A reasoning model would decompose the sentence, identify the trick, and verify the answer against the original wording before committing.
The same pattern shows up in code generation. Ask a standard model to write a function that handles a complex data structure transformation with 5 edge cases, and it will get 3 right and miss 2. It doesn't systematically enumerate edge cases because it doesn't plan ahead. It generates code token by token, committing to an approach before it has fully analyzed the problem.
A reasoning model, by contrast, uses its thinking trace to list the edge cases first, design the approach to handle all of them, and then verify the implementation against each case before producing the final code. The output might look the same, but the process that created it is fundamentally more reliable.
The fundamental limitation: a standard LLM generates tokens left-to-right with no working memory and no ability to reconsider. Reasoning models fix this by generating a long internal trace where the model can explore, verify, and backtrack before committing to a final answer.
What is it?
A reasoning model is an LLM trained to generate an extended internal chain-of-thought before producing its final response. The thinking trace is generated autoregressively at inference time, so it's slow and expensive, but it lets the model systematically work through hard problems.
Think of it like the difference between answering a quiz question out loud versus working it out on scratch paper first. With scratch paper, you can try an approach, realize it's wrong, cross it out, and try another. Without it, you blurt out whatever comes to mind first. Reasoning models give the LLM scratch paper.
The terminology can be confusing because different providers use different names. OpenAI calls them "reasoning models" (o1, o3). DeepSeek calls theirs "R1" (for "reasoning"). Anthropic calls it "extended thinking." Google calls it "thinking mode." They all describe the same core capability: generating a long internal trace before the visible output. In this article, "reasoning model" refers to any model with this trained thinking behavior, regardless of provider.
OpenAI released o1 in September 2024 as the first widely available reasoning model. DeepSeek R1 (January 2025) followed as an open-weight model that matched o1 performance at a fraction of the cost. By mid-2025, every major provider offered reasoning capabilities: Anthropic's Claude 3.7 Sonnet extended thinking, Google's Gemini 2.5 Flash/Pro thinking mode, and OpenAI's o3 and o4-mini.
The key distinction from chain-of-thought prompting: with CoT, you write "Let's think step by step" and the model follows the pattern from pretraining. With reasoning models, the extended thinking is a trained behavior shaped by reinforcement learning. The model autonomously decides how many tokens to spend thinking, which approaches to try, and when to backtrack. That makes it qualitatively more powerful.
This is worth internalizing because it changes how you use the models. You don't need to engineer reasoning prompts for o3 or R1. In fact, you shouldn't. The model has been trained to reason. Your job is to provide a clear problem statement and let the model's RL-trained behavior handle the rest. The less you try to control the thinking process, the better it works.
How it works
Inference-time compute
The big conceptual shift is inference-time compute scaling. Traditionally, you make a model smarter by training it longer on more data with more parameters. Reasoning models introduce a second axis: spend more compute at inference time instead.
For a given training budget, letting a smaller model think for 10,000 tokens can match or beat a larger model that generates an answer in 100 tokens. This means you have two knobs to turn: model size (training compute) and thinking budget (inference compute). The optimal allocation depends on the problem difficulty.
On easy problems, thinking tokens add cost with no benefit. On hard problems, they're transformative. This is why reasoning models are a different tool, not a better version of standard LLMs.
Here's an example that makes the tradeoff concrete. OpenAI's research showed that a smaller model (o1-mini) thinking for 10,000 tokens consistently outperformed GPT-4 (a much larger model) answering in 500 tokens on AIME math problems. The inference cost was higher per query, but the accuracy jump from 30% to 70% meant fewer retries and higher user satisfaction. The total system cost was actually lower because you weren't paying for five wrong answers before getting one right.
The two-axis mental model is essential for system design interviews. When someone asks "how do we handle these really hard queries?", the answer isn't always "use a bigger model." Sometimes it's "use the same model but give it more time to think." That's a fundamentally different (and often cheaper) scaling strategy.
The thinking trace
When you send a query to a reasoning model, the model first generates a thinking trace that can be 10-100x the length of the final answer. This trace typically includes:
- Problem decomposition: breaking the question into sub-problems
- Approach exploration: trying one method, checking if it leads somewhere useful
- Self-verification: "Wait, does this intermediate result make sense?"
- Backtracking: "That approach doesn't work. Let me try a different angle."
- Synthesis: combining successful sub-results into a final answer
- Confidence calibration: estimating how certain the conclusion is before committing
The trace is generated token-by-token, just like any LLM output. But it's shaped by RL training to be useful reasoning rather than just fluent text. In most APIs (OpenAI, Anthropic), the thinking trace is either hidden entirely or returned as a separate field. You pay for every thinking token at the provider's output token rate.
Here's what a thinking trace looks like in practice (simplified from a DeepSeek R1 output on a geometry problem):
<think>
The problem asks for the area of triangle ABC where A=(0,0), B=(4,0), C=(2,3).
I can use the formula: Area = 0.5 * |x_A(y_B - y_C) + x_B(y_C - y_A) + x_C(y_A - y_B)|
Substituting: 0.5 * |0(0-3) + 4(3-0) + 2(0-0)|
= 0.5 * |0 + 12 + 0|
= 6
Wait, let me verify with the base-height method.
Base AB has length 4. Height from C to AB is 3.
Area = 0.5 * 4 * 3 = 6. Confirmed.
</think>
The area of triangle ABC is 6 square units.
The thinking trace here is 8 lines. For a competition math problem, it could be 200+ lines with multiple dead ends and restarts. The final answer is one sentence. This ratio (trace length vs output length) is what makes reasoning models expensive but effective.
The <think> tags are DeepSeek R1's convention. OpenAI's o-series models use a different internal format that isn't directly visible. Anthropic's extended thinking returns the trace in a separate thinking field in the API response. The exact format varies, but the principle is identical: long internal reasoning, short external answer.
For debugging purposes, providers that expose the thinking trace (DeepSeek, Anthropic) give you a powerful tool. You can read the trace to understand why the model gave a wrong answer, which is invaluable for prompt engineering and identifying systematic failure modes.
Training with RL
Reasoning models aren't just standard LLMs with a "think harder" instruction. The training pipeline has distinct stages:
-
Supervised fine-tuning (SFT): the base model is fine-tuned on curated reasoning traces (human-written or generated by a stronger model). This teaches the format of step-by-step reasoning.
-
Reinforcement learning with outcome reward: the model generates reasoning traces for problems with known correct answers. A reward signal is given based on whether the final answer is correct. The model learns which reasoning patterns lead to correct outcomes. DeepSeek R1 used GRPO (Group Relative Policy Optimization), a simpler variant of PPO that doesn't need a separate critic model.
-
Distillation (optional): once a large reasoning model is trained, its reasoning traces can train smaller models. DeepSeek distilled R1 (671B parameters) into 1.5B-70B parameter models that retain much of the reasoning ability at far lower cost.
The key innovation in DeepSeek's approach was GRPO: instead of training a separate critic/value model (as in standard PPO), GRPO estimates baselines from group scores across multiple completions of the same prompt. This cuts training infrastructure requirements roughly in half, which is why DeepSeek could train R1 for approximately $6M in compute, a fraction of what OpenAI reportedly spent on o1.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.