Inference-time scaling

TL;DR

Generate multiple candidate outputs from the same frozen model, then select the best one using a verifier, majority vote, or reward model. No weight updates needed.
Three core strategies: best-of-N sampling (generate N, pick the best), majority voting (consensus answer), and process reward models (score each reasoning step).
Inference-time scaling follows roughly log-linear improvement: doubling compute yields a fixed accuracy gain, with diminishing returns after 16-64 samples.
OpenAI's o1/o3 and DeepSeek-R1 use this internally. AlphaCode generates 1M+ candidate programs and filters, achieving competitive programming performance.
The tradeoff is stark: 10-100x more latency and cost per query. This only pays off for tasks with verifiable answers (math, code, logic) where quality is worth the compute.

Full pattern: Inference-Time Scaling

Your agent is solving a hard math problem. It generates one answer, and it's wrong. You could train a bigger model (6 months, $10M), fine-tune on math data (2 weeks, $50K), or try something simpler: generate 50 answers from the same model and pick the one that's correct. The third option costs $0.50 in API calls and takes 30 seconds.

This is the core insight of inference-time scaling. Traditional ML assumes that model capability is fixed after training. You improve performance by training longer, with more data, or with bigger models. But there's another axis: spend more compute at inference time. Generate multiple candidates, verify them, select the best.

I first saw this clearly when debugging a code-generation pipeline. A single sample from GPT-4 solved 67% of problems correctly. The same model with 100 samples and a code-execution verifier solved 85%. The model didn't change. The weights were identical. We just allocated more compute at inference time to search the model's output distribution for correct answers.

The math is compelling: for many verifiable tasks, doubling inference compute yields more accuracy improvement per dollar than doubling training compute. This realization is reshaping how teams think about LLM deployment. Instead of "buy a bigger model," the answer is increasingly "sample more from the model you have."

What Is It?

Inference-time scaling allocates additional computation during inference to improve output quality. Instead of generating one response, the system generates many candidates and uses a selection mechanism (verifier, vote, or reward model) to pick the best one.

Think of it like a writer drafting an essay. A mediocre writer produces one draft and submits it. A skilled writer produces five drafts, reviews each one critically, takes the best paragraphs from each, and assembles a final version that's stronger than any individual draft. The writer's skill (model weights) didn't change between drafts. What changed was the amount of effort (compute) invested in producing the output.

The key components are the generator (produces candidates), the verifier (scores candidates), and the selector (picks the winner). The generator is your existing LLM. The verifier can be a different model, a code executor, a unit test suite, or a simple majority vote. The selector is usually argmax over verifier scores.

Where Inference-Time Scaling Sits

Strategy	What Changes	Cost Multiplier	Accuracy Gain
Better prompt engineering	Input only	1x	+5-15%
Fine-tuning	Model weights	1x per query (upfront cost)	+10-25%
Bigger model	Architecture + weights	2-10x per query	+5-20%
Inference-time scaling	Compute at inference	10-100x per query	+10-40%
Training-time scaling	More training data/compute	1x per query (massive upfront)	+5-15%

Inference-time scaling occupies a unique position: it's the only approach that improves performance without any model changes, at the cost of per-query compute. This makes it immediately deployable on any existing model.

How It Works

Strategy 1: Best-of-N Sampling

The simplest inference-time scaling strategy. Generate N independent completions from the same prompt (with temperature > 0), score each completion with a verifier, and return the highest-scoring one.

The scoring function is everything. For math: plug the answer back into the equation. For code: run the test suite. For general text: use a reward model trained on human preferences.

Coverage probability is the key metric. If the model generates the correct answer with probability p on a single try, the probability of at least one correct answer in N tries is: $1 - (1-p)^N$. For p = 0.3 and N = 16, coverage reaches 99.2%. Even a model that's only 30% accurate per sample becomes 99%+ accurate with enough samples and a reliable verifier.

The scaling law is logarithmic. Going from N=1 to N=8 gives a large accuracy jump. Going from N=64 to N=128 gives a small one. For most tasks, N=16-32 captures 80%+ of the available improvement.

N (samples)	Coverage (p=0.3)	Coverage (p=0.1)	Coverage (p=0.5)
1	30%	10%	50%
4	76%	34%	94%
8	94%	57%	99.6%
16	99.2%	81%	99.998%
32	99.99%	96%	~100%
64	~100%	99.8%	~100%

Strategy 2: Majority Voting (Self-Consistency)

Generate N completions, extract the final answer from each, and return the answer that appears most frequently. No verifier model needed. Wang et al. (2022) introduced this as "self-consistency" for chain-of-thought prompting.

The insight: different reasoning paths can arrive at the same correct answer. If 7 out of 10 samples produce "x = 42" through different reasoning chains, that answer is likely correct, even if the individual reasoning chains contain errors.

Majority voting works best when wrong answers are randomly distributed (each wrong answer is different) but correct answers converge. It fails when the model has a systematic bias, where a common wrong answer beats the correct one.

Weighted voting improves on naive majority voting. Assign each sample a confidence score (from the model's token probabilities or a lightweight scorer) and use weighted majority. This gives more influence to high-confidence samples.

Strategy 3: Process Reward Models (PRMs)

Outcome Reward Models (ORMs) score complete solutions. Process Reward Models (PRMs) score each intermediate reasoning step. PRMs are strictly more powerful because they can catch errors at the step where they occur, not just at the final answer.

The distinction matters for search efficiency. With an ORM, you must generate full solutions and then score. With a PRM, you can prune bad reasoning paths early, allocating compute to the most promising partial solutions.

Problem Input

>Waiting...

Generate N Samples

>Waiting...

PRM Scoring

>Waiting...

Prune Low Scores

>Waiting...

Expand Best Paths

>Waiting...

Select Best

>Waiting...

Inference-time scaling with a Process Reward Model: generate candidates, score each step, prune, and select the best path

TL;DR

Generate multiple candidate outputs from the same frozen model, then select the best one using a verifier, majority vote, or reward model. No weight updates needed.
Three core strategies: best-of-N sampling (generate N, pick the best), majority voting (consensus answer), and process reward models (score each reasoning step).
Inference-time scaling follows roughly log-linear improvement: doubling compute yields a fixed accuracy gain, with diminishing returns after 16-64 samples.
OpenAI's o1/o3 and DeepSeek-R1 use this internally. AlphaCode generates 1M+ candidate programs and filters, achieving competitive programming performance.
The tradeoff is stark: 10-100x more latency and cost per query. This only pays off for tasks with verifiable answers (math, code, logic) where quality is worth the compute.

Full pattern: Inference-Time Scaling

The Problem It Solves

What Is It?

Where Inference-Time Scaling Sits

Strategy	What Changes	Cost Multiplier	Accuracy Gain
Better prompt engineering	Input only	1x	+5-15%
Fine-tuning	Model weights	1x per query (upfront cost)	+10-25%
Bigger model	Architecture + weights	2-10x per query	+5-20%
Inference-time scaling	Compute at inference	10-100x per query	+10-40%
Training-time scaling	More training data/compute	1x per query (massive upfront)	+5-15%

How It Works

Strategy 1: Best-of-N Sampling

The scoring function is everything. For math: plug the answer back into the equation. For code: run the test suite. For general text: use a reward model trained on human preferences.

The scaling law is logarithmic. Going from N=1 to N=8 gives a large accuracy jump. Going from N=64 to N=128 gives a small one. For most tasks, N=16-32 captures 80%+ of the available improvement.

N (samples)	Coverage (p=0.3)	Coverage (p=0.1)	Coverage (p=0.5)
1	30%	10%	50%
4	76%	34%	94%
8	94%	57%	99.6%
16	99.2%	81%	99.998%
32	99.99%	96%	~100%
64	~100%	99.8%	~100%

Strategy 2: Majority Voting (Self-Consistency)

Strategy 3: Process Reward Models (PRMs)

Problem Input

>Waiting...

Generate N Samples

>Waiting...

PRM Scoring

>Waiting...

Prune Low Scores

>Waiting...

Expand Best Paths

>Waiting...

Select Best

>Waiting...

Inference-time scaling with a Process Reward Model: generate candidates, score each step, prune, and select the best path

Inference-time scaling

TL;DR

The Problem It Solves

What Is It?

Where Inference-Time Scaling Sits

How It Works

Strategy 1: Best-of-N Sampling

Strategy 2: Majority Voting (Self-Consistency)

Strategy 3: Process Reward Models (PRMs)

Continue Reading with Premium

Comments

Inference-time scaling

TL;DR

The Problem It Solves

What Is It?

Where Inference-Time Scaling Sits

How It Works

Strategy 1: Best-of-N Sampling

Strategy 2: Majority Voting (Self-Consistency)

Strategy 3: Process Reward Models (PRMs)

Continue Reading with Premium

Comments