Recursive best-of-N
Generate N candidate solutions, score them with a verifier, then recursively refine the best candidate until quality converges or budget is exhausted.
TL;DR
- Generate N candidate solutions in parallel, score each with a verifier, pick the best, and optionally feed it back as a seed for the next round of generation.
- Quality improves roughly as
log(N): going from N=1 to N=4 is transformative, N=4 to N=16 helps less, and N=16 to N=64 barely moves the needle. - The verifier is the entire bottleneck. A bad verifier makes best-of-N into expensive random sampling. Test execution, formal checkers, and strong judge LLMs are the three reliable verifier types.
- Recursive refinement (2-3 rounds) compounds gains: each round starts from the best candidate of the previous round rather than from scratch.
- Cost is linear in N per round and multiplicative across rounds. N=8 with 3 rounds = 24 LLM calls total, a modest budget for a meaningful quality boost.
The Problem It Solves
Your LLM generates a Python function to parse CSV files with nested quotes. The output looks reasonable but fails on edge cases: escaped delimiters inside quoted fields, trailing newlines, empty columns. You re-run the same prompt and get a slightly different version that handles escaped delimiters but breaks on empty columns. A third attempt handles empty columns but introduces a regression on quoted fields.
Each individual sample is imperfect, but they're imperfect in different ways. The knowledge needed to solve the problem completely exists across the samples. The bottleneck isn't the model's capability; it's that any single sample captures only a portion of the model's distribution of correct behaviors.
I've seen this pattern repeatedly in production: the model "knows" the right answer, but a single sample is unlikely to surface it. Temperature 0 gives one deterministic (and often flawed) output. Temperature 0.7 gives variety but no mechanism to select the best variant. What's missing is systematic sampling followed by evaluation and selection.
Best-of-N solves this by generating N candidates, scoring each with a verifier, and keeping only the winner. Recursive best-of-N takes it further: the winner becomes a seed for the next round of generation, letting the model refine an already-good solution rather than starting from scratch. This is pure inference-time scaling, trading compute for quality without any model fine-tuning.
What Is It?
Recursive best-of-N generates N candidate solutions in parallel, evaluates each with a verifier, selects the highest-scoring candidate, and (optionally) feeds that winner back as context for the next round of generation, repeating until quality converges or the compute budget is exhausted.
Think of it like a writing competition with multiple rounds. In round one, 8 contestants write essays independently. A panel of judges scores them, and the best essay advances. In round two, 8 new contestants read the winning essay as inspiration, then write their own improved versions. The judges score again, and the best advances. After 3 rounds, the final essay is dramatically better than any single first-draft attempt.
The key insight: each round doesn't start from zero. The best candidate from round R becomes few-shot context (or a refinement prompt) for round R+1. This means the model builds on proven good work rather than independently guessing from scratch.
Two Levels of Best-of-N
| Level | Mechanism | Cost | Quality Gain |
|---|---|---|---|
| Level 0: Basic | Sample N, pick best | N calls | log(N) improvement |
| Level 1: Recursive | Sample N, pick best, use as seed for next round, repeat R times | N x R calls | Compounds per round |
Most teams start with Level 0. Level 1 is worth adding when Level 0's returns have diminished and you still need higher quality.
How It Works
Step 1: Parallel Generation
The generator produces N independent solutions from the same prompt. Independence is critical: each candidate should be a fresh attempt, not a minor variation of the previous one. Use temperature 0.7-1.0 to ensure diversity across samples.
The prompt structure matters. For Level 0, every candidate gets the same prompt. For Level 1 (recursive rounds after the first), candidates receive the winning solution from the previous round as few-shot context: "Here's a good solution. Generate an improved version that fixes any remaining issues."
Parallelism is where best-of-N shines operationally. All N candidates can be generated simultaneously with concurrent API calls. Unlike tree search (LATS, ToT), there's no sequential dependency between candidates within a round. This means wall-clock latency is the same as generating a single response, assuming you have API concurrency.
Step 2: Verification
Each candidate is scored by a verifier. The verifier is the most important design decision in the entire pattern. Three verifier types, in order of reliability:
Deterministic verifiers (test execution, formal checkers, SQL comparison). Run the generated code against a test suite. Compare the SQL output to expected results. Verify the math proof with a symbolic checker. These provide ground-truth signal with zero noise. If you have a deterministic verifier, best-of-N is nearly guaranteed to improve with larger N.
LLM-as-judge verifiers. A separate LLM (or the same model with a judge prompt) scores each candidate on criteria like correctness, completeness, code quality, and adherence to requirements. This is noisier than deterministic verification, but works for tasks without formal test suites (creative writing, design documents, explanations). Use a stronger model as the judge: if your generator is GPT-4o-mini, use GPT-4 as the judge.
Heuristic verifiers. Rule-based scoring: does the output contain required keywords? Is it within the length limit? Does it follow the specified format? Fast and cheap, but shallow. Best used as a pre-filter before a more expensive LLM-as-judge pass.
I've seen teams skip the verifier design step and use the simplest scoring method available. This is the single biggest mistake with best-of-N. A random verifier makes N-way sampling into expensive randomness. Invest 80% of your design time on the verifier.
Step 3: Selection and Seed Refinement
The highest-scoring candidate is selected. In Level 0 (basic best-of-N), this is the final output. In Level 1 (recursive), the winner becomes the seed for the next round.
Seed refinement changes the prompt from "Solve this problem" to "Here is a strong solution. Improve it by fixing any remaining issues, edge cases, or inefficiencies." This narrows the search space from "all possible solutions" to "variations near the best-known solution," which consistently produces higher-quality refinements.
The typical recursive depth is 2-3 rounds. Beyond that, gains flatten because the model has already converged to a near-optimal solution within its capability range. I've run experiments with 5 rounds and seen essentially zero improvement from round 3 onward.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.