Chain of thought
Learn how chain-of-thought prompting makes LLMs show their reasoning steps, why it dramatically improves accuracy on complex tasks, and when to use zero-shot vs few-shot CoT.
TL;DR
- Chain-of-thought (CoT) prompting makes LLMs externalize reasoning steps before answering, boosting GSM8K accuracy from 17.9% to 58.1% on PaLM 540B.
- Zero-shot CoT is one sentence: append "Let's think step by step." Few-shot CoT provides worked examples the model imitates.
- Self-consistency samples 10-40 reasoning paths and takes majority vote, adding another 10-20 percentage points on hard problems.
- CoT only helps models above roughly 100B parameters. Smaller models see minimal or negative benefit.
- Reasoning models (o1, o3, DeepSeek-R1) perform extended CoT internally via reinforcement learning. Prompting them for explicit reasoning is redundant.
- The engineering decision: use CoT for multi-step reasoning tasks, skip it for factual retrieval, and graduate to reasoning models when CoT alone is not enough.
The problem it solves
Ask GPT-4 a direct question: "A bat and ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?" Without CoT, the model often answers $0.10. That's wrong. The correct answer is $0.05.
The failure is systematic, not random. Models trained to predict the next token optimize for fluency, not deliberate reasoning. On multi-step problems, they pattern-match to the surface structure and jump to an answer that "feels right." The intermediate algebra ($1.10 - $1.00 = $0.10 is the gap, not the answer) never gets computed because the model never generates those intermediate tokens.
This isn't a rare edge case. On the GSM8K benchmark (grade school math, 8,500 problems), standard prompting yields under 18% accuracy on PaLM 540B. The model can do the math. It just doesn't do the math unless you tell it to show its work.
The same pattern appears in code generation, legal reasoning, medical diagnosis, and every other domain where the answer depends on a chain of intermediate conclusions. Direct prompting fails wherever the gap between question and answer requires more than one logical step.
The core insight: when you force the model to generate intermediate tokens, each step constrains what the next step can be. Errors that would compound silently in a direct answer get caught because they produce obviously wrong intermediate values. This is the same reason math teachers require "show your work."
What is it?
Chain-of-thought prompting is a technique where you instruct the model to reason through a problem step by step before giving the final answer. Wei et al. introduced it in 2022 and showed it dramatically improves accuracy on arithmetic, commonsense, and symbolic reasoning tasks without any fine-tuning or model changes.
Think of it like showing your work on a math exam. When a student writes only the answer, they might get it right or wrong and you can't tell where the reasoning broke down. When they show each step, the teacher (and the student themselves) can catch mistakes mid-derivation. CoT does the same thing for LLMs: the "showing work" tokens create a working memory the model can reference as it generates.
There are two main flavors. Zero-shot CoT appends a single phrase like "Let's think step by step" to your prompt. Few-shot CoT provides several worked examples with explicit reasoning chains, then presents the real question. Both work. Zero-shot is more convenient. Few-shot wins when you need the model to reason in a specific structure or domain.
The reason this works is mechanical, not magical. During pretraining, the model saw millions of worked solutions. CoT prompting creates the context that tells the model "we're in problem-solving mode, show the work." The reasoning tokens become a scratchpad the model can reference as it generates each subsequent token.
I think of CoT as the single highest-leverage prompting technique that exists. One sentence in your prompt can triple accuracy on math problems. Nothing else in prompt engineering comes close to that ROI.
Here's the simplest possible example. Without CoT:
Q: If a store has 3 shelves with 8 books each, and gets
a delivery of 15 books, how many books total?
A: 39
With zero-shot CoT:
Q: If a store has 3 shelves with 8 books each, and gets
a delivery of 15 books, how many books total?
Let's think step by step.
A: First, calculate books on shelves: 3 x 8 = 24 books.
Then add the delivery: 24 + 15 = 39 books.
The answer is 39.
The answer happens to be the same here, but the CoT version is verifiable. When the model gets a harder problem wrong, the reasoning trace shows you exactly which step failed. That's the engineering value: debuggability. In production, CoT reasoning traces are your audit log for model decisions.
How it works
Zero-shot CoT
Kojima et al. (2022) made a surprisingly simple discovery: appending "Let's think step by step" to a prompt activates reasoning behavior already latent in large models. No examples needed. No special formatting. Just that one phrase.
The mechanism is straightforward. The model's pretraining corpus included millions of worked solutions: math textbooks, code with comments, argumentative essays with step-by-step logic. When you say "let's think step by step," you're activating those latent reasoning patterns. The model shifts from "predict the most likely answer token" to "predict the most likely next reasoning token."
Why does this work mechanically? Autoregressive models generate one token at a time. Each new token can attend to all previously generated tokens. When the model generates reasoning tokens before the answer token, those intermediate tokens become part of the context. The answer token is now conditioned on a sequence that includes the worked steps, not just the original question. It's the difference between solving a math problem in your head versus writing it out on paper.
On the MultiArith benchmark, zero-shot CoT improved accuracy from 17.7% to 78.7% on InstructGPT (175B). On AddSub, it went from 74.5% to 85.7%. These are massive gains from a single sentence.
The practical advice: start here. Zero-shot CoT is your baseline for any multi-step reasoning task. If it's not accurate enough, escalate to few-shot CoT.
Few-shot CoT
Wei et al. (2022) showed that including 3-8 worked examples with explicit reasoning chains gives the model a template to follow. The model doesn't just reason; it reasons in the format you demonstrated.
Few-shot CoT on PaLM 540B pushed GSM8K accuracy from 17.9% to 58.1%. That's a 3.2x improvement from adding examples to the prompt.
I use few-shot CoT when I need the model to reason in a specific structure. If I want tabular analysis, I show an example of tabular reasoning. If I want the model to consider edge cases, I include an example that does exactly that. The model follows the template faithfully.
One practical note: the quality of your examples matters enormously. Bad examples (sloppy reasoning, skipped steps) produce bad reasoning chains. I spend 80% of my few-shot CoT prompt engineering time on the examples and 20% on the actual question formatting.
Self-consistency
Wang et al. (2022) observed that sampling a model multiple times produces diverse reasoning paths. Wrong answers tend to be diverse (each wrong for a different reason), while correct answers cluster around the same value.
Self-consistency exploits this: sample N reasoning chains with temperature > 0, extract the final answer from each, and take the majority vote.
On GSM8K, self-consistency with 40 samples pushed PaLM 540B from 58.1% (few-shot CoT) to 74.4%. On the ARC challenge, it improved from 85.2% to 88.7%. The pattern holds across benchmarks: 10-20 percentage point gains on tasks where single-path CoT plateaus.
The tradeoff is pure cost. Sampling 40 times means 40x the API calls. You can often get most of the benefit with 5-10 samples. My rule of thumb: use self-consistency only for high-stakes decisions where the marginal accuracy is worth the cost, like medical triage or financial analysis. Start with 5 samples, increase only if the accuracy gain justifies the spend.
Tree of Thoughts
Yao et al. (2023) extended CoT from a single chain into a search tree. At each reasoning step, the model generates multiple candidate continuations, evaluates them, and prunes the weak branches before continuing.
The key difference from self-consistency: self-consistency samples independent paths and votes at the end. Tree of Thoughts evaluates and prunes at each step. This makes ToT better for problems with large reasoning spaces where bad branches should be abandoned early rather than explored to completion.
Tree of Thoughts (ToT) is powerful for problems that require exploration: creative writing with constraints, puzzle-solving, code architecture planning. On the "Game of 24" puzzle (combine four numbers using arithmetic to make 24), GPT-4 with standard prompting solves 7.3% of cases. With CoT it solves 4.0%. With ToT it solves 74%. Some problems genuinely need search, not just linear chains.
In practice, ToT is complex to implement. You need a custom loop that manages branching, evaluation scoring, and pruning. Most production systems stick with self-consistency because it gets 80% of the benefit at 20% of the complexity.
Why CoT works (the mechanism)
The mechanical explanation is worth understanding because it clarifies when CoT will and won't help.
LLMs are autoregressive: they generate one token at a time, and each new token can attend to all previous tokens. When you ask for a direct answer, the model must compress all reasoning into the hidden states of a single forward pass. That's a computation bottleneck. Complex reasoning requires more "serial computation" than one forward pass provides.
CoT solves this by spreading the computation across multiple generation steps. Each reasoning token is a new forward pass that can attend to all previously generated reasoning tokens. The model essentially gets more compute per problem. This is why CoT helps on hard problems (which need more computation) and doesn't help on easy ones (which fit in a single forward pass).
This also explains the 100B parameter threshold. Smaller models don't have enough latent reasoning patterns in their weights to produce useful intermediate steps. The reasoning tokens they generate are noise, not signal.
The connection to reasoning models is direct. o1, o3, and DeepSeek-R1 took this insight and trained it into the model via reinforcement learning. Instead of relying on a prompt to activate reasoning, these models learn when and how to reason on their own. CoT prompting is the manual version. Reasoning models are the automated version. The tradeoff: reasoning models cost more per token but produce more reliable reasoning on harder problems.
Key variants / types
| Variant | How it works | Best for | Tradeoff |
|---|---|---|---|
| Zero-shot CoT | Append "Let's think step by step" | Quick prototyping, general reasoning tasks | No control over reasoning style |
| Few-shot CoT | Provide worked examples with reasoning chains | Domain-specific reasoning, structured output | Requires crafting good examples |
| Self-consistency | Sample N paths, majority vote on final answer | Hard math, high-stakes decisions | N times the token cost and latency |
| Tree of Thoughts | Branch and evaluate at each reasoning step | Exploration problems, puzzles, planning | Complex implementation, high compute |
| Plan-and-Solve | "Devise a plan, then solve step by step" | Complex multi-part problems, code generation | Slightly more tokens than zero-shot CoT |
Plan-and-Solve (Wang et al., 2023) deserves a quick mention. Instead of diving straight into reasoning, you prompt the model to first outline a plan, then execute each step. It consistently outperforms zero-shot CoT by 5-8% on complex word problems because the planning step prevents the model from going down a wrong path early.
The progression from zero-shot to Plan-and-Solve reveals something fundamental: the more structure you give the model's reasoning, the better it performs. But each level of structure costs more in prompt engineering effort. Zero-shot is free. Few-shot takes 30 minutes to craft good examples. Self-consistency multiplies your API bill. Choose the level that matches your accuracy requirement and budget.
Interview tip: know the escalation ladder
Start with zero-shot CoT (free). Move to few-shot CoT (costs prompt tokens). Add self-consistency (costs N times inference). Use Tree of Thoughts only for offline or research workloads. This escalation ladder shows you understand cost-accuracy tradeoffs.
When to use / when to avoid
So when does CoT actually help? The line is cleaner than most prompt engineering decisions. The rule of thumb: if the answer requires more than one logical step, try CoT. If it doesn't, skip it.
When to use CoT
- Multi-step math and logic: arithmetic word problems, constraint satisfaction, scheduling. CoT is the single biggest accuracy lever for these tasks.
- Code generation and debugging: "Trace through what each line does before writing the fix" catches subtle off-by-one errors that direct prompting misses.
- Multi-hop reasoning: questions that require combining facts from different parts of the context. CoT forces the model to connect the dots explicitly.
- Auditability requirements: when you need to show why the model reached a conclusion (medical, legal, financial), the reasoning trace is the audit trail.
- Plan generation: any task where the model needs to outline a plan before executing. "First, identify the components. Then, define their relationships. Finally, determine the data flow."
- Complex extraction: pulling structured data from unstructured text where the model needs to identify entities, resolve references, and infer relationships.
When to avoid CoT
- Simple factual retrieval: "What is the capital of France?" CoT adds latency and occasionally confuses the model into second-guessing itself.
- Classification tasks: sentiment analysis, intent detection. These are single-step pattern matches. CoT adds cost without accuracy gains.
- Small models (under 100B params): Kojima et al. found that CoT provides minimal benefit on models smaller than roughly 100B parameters. The reasoning patterns aren't reliably encoded in smaller models.
- Latency-critical paths: each reasoning token is a serial generation step. CoT adds 1-5 seconds to typical prompts. If your SLA is 200ms, CoT is not an option.
- Already using reasoning models: o1, o3, and DeepSeek-R1 perform extended CoT internally. Prompting them for explicit reasoning is redundant and wasteful.
- High-throughput batch processing: when you're processing millions of queries and the task is simple enough, the cumulative token overhead of CoT makes it cost-prohibitive.
The model size cliff
CoT benefits don't degrade gracefully with smaller models. They fall off a cliff. A 70B model might see modest improvements. A 7B model often performs worse with CoT than without. Always test on your actual model before committing to a CoT strategy.
For your interview: the key signal is knowing when NOT to use CoT. Candidates who say "I'd add CoT to everything" reveal they don't understand the cost-benefit tradeoff.
Real-world examples
Google PaLM on GSM8K. Standard prompting: 17.9% accuracy. Few-shot CoT: 58.1%. Self-consistency (40 paths): 74.4%. This benchmark result from Wei et al. (2022) is the canonical demonstration. The model went from failing grade-school math to passing it, just by changing the prompt. This is the result I'd cite in any interview where CoT comes up.
Code debugging with reasoning traces. A developer tools company added CoT to their AI code review pipeline: "Trace through the function line by line. What does each variable hold after each statement? Now identify the bug." This approach caught 73% of off-by-one and null-reference bugs that direct "find the bug" prompting missed. The reasoning trace also became the explanation shown to developers, so they could verify the AI's logic before accepting the fix.
The pattern here is important: CoT doesn't just improve accuracy, it produces the explanation as a free byproduct.
Medical symptom triage. A healthcare startup used few-shot CoT for classifying patient symptoms into urgency tiers. Direct classification misrouted 8% of critical cases. With CoT (asking the model to reason through symptom severity, patient history, and red-flag indicators step by step), misclassification of critical cases dropped to 2%. That 6% difference meant hundreds of patients per month got routed to the right care level. The reasoning trace also served as the clinical rationale shown to nurses for review.
Customer support escalation. A SaaS company applied self-consistency (5 paths) to their intent classification pipeline for support tickets. Single-path accuracy was 82%. Self-consistency pushed it to 91%, reducing wrong-team routing by over half. The 5x cost increase was worth it because each misrouted ticket cost $15 in wasted agent time.
SQL query generation. A data platform added CoT to their natural-language-to-SQL pipeline. The prompt asks the model to first identify the relevant tables, then list the join conditions, then construct the query step by step. This reduced syntax errors from 22% to 6% and semantic errors (valid SQL that answers the wrong question) from 18% to 7%. The reasoning trace also became the explanation shown to analysts, who could verify the logic before running the query.
The benchmark landscape
Key results to remember: GSM8K 17.9% β 58.1% β 74.4% (standard β CoT β self-consistency, PaLM 540B). MultiArith 17.7% β 78.7% (zero-shot CoT, InstructGPT 175B). Game of 24: 7.3% β 74% (standard β Tree of Thoughts, GPT-4). These numbers tell the story of what CoT can do.
Limitations and tradeoffs
Token cost scales linearly. A reasoning chain adds 100-500 tokens per query. Self-consistency with 20 samples multiplies that by 20. At scale, this is real money. A system processing 1M queries/day with 300-token reasoning chains at $0.01/1K tokens adds $3,000/day just for the thinking. With self-consistency at 20 samples, that becomes $60,000/day.
Latency is inherently serial. Each reasoning token depends on the previous one. You cannot parallelize CoT generation within a single query. Expect 1-5 seconds of added latency per reasoning chain. For user-facing applications, this may push you past acceptable response times.
Small models don't benefit. The 100B parameter threshold from Kojima et al. is approximate but directional. Models under this size lack the latent reasoning patterns that CoT activates. On smaller models, CoT sometimes degrades performance by forcing the model to generate confident-sounding but wrong intermediate steps. If you're using a 7B or 13B model, invest in fine-tuning on reasoning chains rather than prompting for them.
Hallucinated reasoning. CoT doesn't guarantee correct reasoning. The model can produce a chain that looks logical but contains a subtle error (wrong arithmetic, false premise). The reasoning trace helps you catch these errors, but it also gives the model a way to construct elaborate justifications for wrong answers. I've seen models produce five perfectly logical-looking steps that arrive at a completely wrong answer, and the user trusted it because the steps "made sense."
Reasoning models make explicit CoT redundant. o1, o3, and DeepSeek-R1 are trained via reinforcement learning to produce internal chains of thought. Their "thinking tokens" are generated and billed but often hidden from the output. Prompting these models with "let's think step by step" is paying twice for the same capability.
Watch for confident wrong reasoning
The most dangerous failure mode is when CoT produces a plausible-looking chain that arrives at a wrong answer. The chain gives the user false confidence. Always validate CoT outputs on critical decisions, either with self-consistency or external verification.
The fundamental tension: CoT trades tokens and latency for accuracy. The right choice depends on where your task sits on the "simple lookup vs. complex reasoning" spectrum. Most teams over-apply CoT to tasks that don't need it, wasting money without improving results. If you're unsure, run a quick A/B test: 50 queries with CoT, 50 without. If accuracy doesn't change, skip it.
How this shows up in interviews
When to bring it up
Mention CoT whenever the interviewer's problem involves multi-step reasoning in an LLM pipeline: building a math tutor, a code review assistant, a medical triage system, or any application where the model needs to "think" rather than retrieve. Also mention it when someone proposes using a reasoning model for a simple task; knowing that basic CoT is often sufficient (and much cheaper) is a strong signal.
Depth calibration
- Junior: knows zero-shot CoT exists, can explain why "let's think step by step" helps.
- Senior: can compare zero-shot vs few-shot CoT, explains self-consistency, understands the 100B parameter threshold, knows when CoT wastes tokens.
- Staff: discusses Tree of Thoughts, Plan-and-Solve, connects CoT to reasoning models as "learned CoT," articulates the cost-accuracy tradeoff curve and can sketch a system that adaptively applies CoT only when needed. Knows the Game of 24 and GSM8K benchmark numbers.
Interview Q&A
| Interviewer asks | Strong answer |
|---|---|
| "How would you improve accuracy on this reasoning task?" | "Add CoT prompting first (free). If that's insufficient, try few-shot CoT with domain examples. For critical paths, add self-consistency with 5-10 samples." |
| "Why not just use CoT everywhere?" | "It adds 100-500 tokens per query. On factual lookups and classification, it increases cost with zero accuracy gain. Apply it selectively to multi-step tasks." |
| "What's the difference between CoT and reasoning models?" | "CoT is prompt-level. Reasoning models like o1 perform extended CoT internally via RL training. They're more capable on hard problems but cost 5-10x more per query." |
| "How does self-consistency work?" | "Sample N reasoning paths with temperature > 0, extract final answers, take majority vote. Wrong answers are diverse; correct ones cluster. It adds N times cost but 10-20 percentage points of accuracy." |
| "When does CoT fail?" | "Small models (under 100B params), simple factual queries, and latency-critical paths. Also watch for hallucinated reasoning where the chain looks valid but contains subtle errors." |
Common interview mistakes
| Mistake | Why it's wrong | What to say instead |
|---|---|---|
| "I'd use CoT for all LLM prompts" | CoT wastes tokens on simple tasks and can degrade performance on classification | "I'd apply CoT selectively to multi-step reasoning tasks and skip it for factual retrieval" |
| "CoT works on any model" | Models under ~100B parameters lack the latent reasoning patterns CoT activates | "CoT benefits scale with model size. Below 100B params, the gains are minimal or negative" |
| "Self-consistency is just resampling" | It specifically exploits the clustering property of correct answers | "Self-consistency works because wrong answers are diverse while correct answers converge" |
| Confusing CoT prompting with reasoning models | Reasoning models do trained CoT internally via RL, not via prompt instructions | "Reasoning models internalized CoT through RL. Prompting them to think step by step is redundant" |
| "More reasoning steps is always better" | Long chains can drift or introduce errors; the model may hallucinate reasoning | "The chain should be as long as the problem requires and no longer. Over-reasoning adds error surface" |
Test your understanding
Quick recap
- Chain-of-thought prompting makes LLMs externalize intermediate reasoning steps, boosting accuracy on multi-step tasks by 2-4x with zero model changes.
- Zero-shot CoT ("Let's think step by step") is the simplest and cheapest entry point, effective on models above roughly 100B parameters.
- Few-shot CoT provides worked examples that control the reasoning format, achieving 58.1% on GSM8K versus 17.9% with standard prompting (PaLM 540B).
- Self-consistency samples multiple reasoning paths and takes majority vote, adding 10-20 percentage points for an N-times cost multiplier.
- CoT wastes tokens on simple factual retrieval, classification, and small models. Apply it selectively to multi-step reasoning tasks.
- Reasoning models (o1, o3, DeepSeek-R1) perform extended CoT internally via RL training, making prompt-level CoT redundant when using them.
- The engineering decision: classify your queries, apply CoT only where reasoning is needed, and graduate to reasoning models for the hardest problems.
- The escalation ladder: zero-shot CoT (free) β few-shot CoT (prompt tokens) β self-consistency (N times cost) β reasoning models (5-10x per token).
Related concepts
- Few-shot prompting - The foundation CoT builds on. Few-shot CoT combines example-based prompting with explicit reasoning chains.
- Reasoning models - The next evolution. Models like o1 internalize CoT through reinforcement learning, producing better reasoning at higher per-token cost.
- Context engineering - CoT is one tool in the broader context engineering toolkit. Understanding how to structure the entire prompt window matters.
- Large language models - CoT only works on large models. Understanding why requires knowing how scale affects emergent capabilities.