Tree-of-thought reasoning
Explore multiple reasoning branches simultaneously, evaluate each intermediate step, and backtrack from dead ends (unlike linear chain-of-thought).
TL;DR
- Tree-of-Thought (ToT) turns LLM reasoning into a search problem: generate multiple candidate reasoning paths, evaluate intermediate states, prune bad branches, and backtrack from dead ends.
- An evaluator (the same LLM or a separate one) scores each intermediate thought step, letting the system focus compute on the most promising directions.
- On the Game of 24 benchmark, ToT solved 74% of problems compared to chain-of-thought's 4%.
- Search strategy matters: BFS explores breadth for problems with many viable paths, DFS goes deep when early commitment is safe.
- The cost is real: 10-50x more tokens than linear CoT. Reserve this for high-value reasoning tasks where correctness matters more than speed.
The Problem It Solves
Your coding agent is debugging a complex test failure. It reads the stack trace, forms a hypothesis ("the null pointer comes from the database layer"), and starts tracing that path. Three tool calls later, it hits a dead end. The database layer is fine. But the agent has already committed 2,000 tokens to this reasoning path and has no mechanism to back up and try a different hypothesis.
This is the fundamental limitation of linear chain-of-thought. CoT produces one reasoning chain, step by step, left to right. If step 3 is wrong, steps 4 through 10 are wasted compute built on a bad foundation. The model can't look sideways at alternative approaches or rewind to an earlier branching point.
I've watched agents burn through entire context windows chasing a single flawed hypothesis. The fix was always the same: the agent needed to explore multiple paths simultaneously and cut the losers early.
Tree-of-Thought, introduced by Yao et al. (2023) and independently by Long (2023), addresses this by turning reasoning into a search problem. Instead of producing one chain of thoughts, the agent explores a tree of possible reasoning paths, evaluates intermediate states, prunes unpromising branches, and backtracks when necessary. The result is dramatically higher accuracy on hard problems at the cost of higher token usage.
What Is It?
Tree-of-Thought reasoning treats each intermediate reasoning step as a node in a search tree. Instead of producing one linear chain, the LLM generates multiple candidate "thoughts" at each step, an evaluator scores them, and a search algorithm (BFS or DFS) decides which branches to expand next and which to abandon.
Think of it like a chess grandmaster analyzing a position. A novice sees one move and plays it. The grandmaster mentally explores 3-4 candidate moves, evaluates the resulting positions several moves deep, prunes the lines that lead to trouble, and only then commits to the best path. ToT gives an LLM this same explore-evaluate-prune capability.
The tree structure makes three things possible that linear CoT cannot do: exploring multiple hypotheses in parallel, evaluating partial progress before committing, and backtracking when a branch fails.
How ToT Compares to Other Reasoning Strategies
Before diving into the mechanism, it helps to see where ToT sits in the landscape of LLM reasoning approaches. Each strategy makes a different tradeoff between compute budget and reasoning quality.
| Strategy | Paths Explored | Evaluation Point | Backtracking | Token Cost |
|---|---|---|---|---|
| Standard prompting | 1 | Final answer only | None | 1x |
| Chain-of-Thought | 1 | Final answer only | None | 1.5-2x |
| Self-consistency | k (independent) | Final answers compared | None | kx |
| Tree-of-Thought | k^d (pruned) | Every intermediate step | Explicit | 10-50x |
| LATS | k^d (pruned + feedback) | Environment + intermediate | Explicit + learning | 20-100x |
CoT and self-consistency are points on the same spectrum. ToT pushes further by evaluating intermediate progress and pruning early. LATS goes even further by adding real environment feedback.
The key takeaway: each strategy to the right in this table pays more tokens for better reasoning quality. The right choice depends on how hard your task is and how much correctness is worth.
How It Works
Step 1: Thought Decomposition
The first design decision is how to decompose the problem into "thought steps." Each thought is an intermediate reasoning unit small enough to evaluate independently but large enough to represent meaningful progress.
For the Game of 24 (combine four numbers using arithmetic to make 24), a thought is one arithmetic operation: "8 * 3 = 24, remaining: [24, 1, 1]." For creative writing, a thought might be a full paragraph. For code debugging, a thought is one hypothesis about the root cause plus the evidence that supports or refutes it.
The granularity matters enormously. Too fine-grained and the tree explodes with branching factor. Too coarse and evaluation becomes unreliable because each node bundles too many decisions together. I've found that 3-5 thoughts per step with 3-4 steps deep is the sweet spot for most reasoning tasks.
The right decomposition depends entirely on the task. Here's a guide:
| Task Type | Thought Granularity | Example Thought |
|---|---|---|
| Arithmetic (Game of 24) | One operation | "8 * 3 = 24, remaining: [24, 2, 1]" |
| Code debugging | One hypothesis + test | "Hypothesis: NPE from line 42. Evidence: stack trace shows..." |
| Essay planning | One paragraph plan | "Para 3: contrast approach A with B, cite Smith 2023" |
| Logic puzzles | One constraint applied | "If A=true, then B must be false (rule 3)" |
Step 2: Thought Generation
At each node, the LLM generates k candidate continuations. Two strategies:
Sample independently: Call the LLM k times with temperature > 0 on the same prompt. Each sample is an independent draw from the distribution. Better for creative or open-ended reasoning where diversity matters.
Propose as a set: Ask the LLM to generate k distinct continuations in a single call: "Propose 3 different next steps for this problem." More efficient (one API call vs k), but candidates may be less diverse because they share the same generation context.
Step 3: State Evaluation
The evaluator is the critical component. It takes an intermediate state and returns a score indicating how promising this partial solution looks. Two approaches:
Value-based scoring: The LLM rates the state on a numerical scale or with labels ("sure / likely / impossible"). For Game of 24: "Given remaining numbers [24, 2, 1], can you reach 24? Rate: sure." This works well when the LLM can assess partial progress reliably.
Vote-based scoring: Generate the full solution from this state multiple times and count how many succeed. More expensive but more reliable. The success rate across samples becomes the node's value.
The evaluator doesn't need to be perfect. It just needs to be right more often than random. Even a noisy evaluator that correctly prunes 60% of bad branches dramatically reduces wasted compute compared to exhaustive search.
External vs Self-Evaluation
The choice of evaluator is the most impactful design decision in a ToT system, and most teams get it wrong by defaulting to LLM self-evaluation.
Self-evaluation asks the same (or similar) LLM to score intermediate states. This is easy to implement but suffers from the same blind spots as the generator. If the LLM thinks "8 + 2 = 12" is correct during generation, it will probably think it's correct during evaluation too.
External verification uses a deterministic tool to check intermediate states. For math: a calculator. For code: compile and run tests. For logic puzzles: a constraint checker. External verifiers don't hallucinate and don't have model-specific biases.
| Evaluator Type | Reliability | Cost | Setup Complexity | Best For |
|---|---|---|---|---|
| LLM self-eval | Medium (60-75% accuracy) | Low (1 LLM call) | Trivial | Subjective tasks, prototyping |
| LLM vote-eval | High (80-90% accuracy) | High (5+ LLM calls) | Low | Any task, accuracy-critical |
| External verifier | Very high (95%+) | Very low (no LLM) | Medium-high | Math, code, logic, constraints |
| Hybrid (verifier + LLM) | Highest | Medium | High | Production systems |
The hybrid approach is what I recommend for production: use an external verifier where possible, fall back to LLM vote-based evaluation for steps that can't be programmatically checked.
Step 4: Search Algorithm
Two primary strategies navigate the tree:
Breadth-First Search (BFS): At each depth level, expand all nodes, evaluate them, keep the top-b (beam width), and discard the rest. BFS is the right choice when you need to compare options at the same depth and the branching factor is manageable. The Game of 24 paper used BFS with b=5.
Depth-First Search (DFS): Expand one branch fully to the end, backtrack if it fails, try the next branch. DFS uses less memory and finds solutions faster when the first deep path is likely correct. Better for code debugging where you want to fully test one hypothesis before moving to the next.
A hybrid approach is also common: use BFS for the first 1-2 levels to establish diverse starting points, then switch to DFS on the most promising branches to find solutions quickly. This captures the broad exploration benefits of BFS early while avoiding its memory overhead for deep search.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.