Multi-agent debate
Run multiple LLM instances as debaters that critique each other's answers, converging on more accurate outputs than any single model produces alone.
TL;DR
- Multiple LLM instances independently answer a question, then see each other's responses, critique them, and update their positions across debate rounds.
- Cross-critique surfaces errors that self-reflection misses: Du et al. (2023) showed 5-20% accuracy improvements on factual QA benchmarks including TruthfulQA.
- Independence before debate is non-negotiable. If debaters share context before generating initial answers, they converge to the same mistakes.
- Two termination strategies: fixed rounds (2-3 rounds, predictable cost) or convergence detection (stop when debaters agree, variable cost).
- Token cost scales linearly with debater count. Three debaters over 3 rounds = 9x the tokens of a single call. Reserve for high-stakes decisions where correctness justifies the cost.
The Problem It Solves
Your agent generates a summary of a financial report. The summary states that revenue grew 12% year-over-year. It sounds confident. It's also wrong: revenue grew 12% quarter-over-quarter, and the year-over-year number is 4%. A single LLM has no adversarial pressure to double-check its own claims. It commits to the first plausible-sounding interpretation and moves on.
Self-reflection helps a little, but the problem is deeper. When you ask the same model to "review your answer for errors," it tends to confirm its own reasoning. The model generated that 12% claim, and when asked to evaluate it, the same weights, the same biases, and the same attention patterns that produced the error are the ones judging it. This is confirmation bias, automated.
I've seen this pattern cause real damage in production: legal summaries with inverted party names, medical research summaries with wrong dosage numbers, code reviews that miss the same class of bug the author introduced. The common thread is that a single perspective, no matter how capable, has systematic blind spots.
The insight behind multi-agent debate, formalized by Du et al. (2023) in "Improving Factuality and Reasoning in Language Models through Multi-Agent Debate," is that multiple independent perspectives with adversarial cross-examination produce more accurate outputs than any single perspective checking itself. The same principle underpins peer review in science, adversarial proceedings in law, and red-teaming in security.
What Is It?
Multi-agent debate runs two or more LLM instances as independent debaters. Each generates an initial answer in isolation, then all debaters see each other's answers and are asked to critique, challenge, and refine their positions. This critique-and-update cycle repeats for a fixed number of rounds or until the debaters converge on a shared answer.
Think of it like a courtroom trial. A single investigator might miss exculpatory evidence because they've already formed a theory of the case. But a prosecution and defense, each motivated to find holes in the other's argument, together produce a more complete picture than either would alone. The judge (or a synthesis agent) then weighs both arguments to reach a verdict.
The key mechanism is that each debater has an uncorrelated context window. They don't share hidden state, chain-of-thought traces, or intermediate reasoning. This independence means their errors are largely uncorrelated, so when debater A makes an error that debater B doesn't, the cross-critique catches it.
Why Independence Matters (Mathematically)
The accuracy improvement from debate is directly proportional to how independent the debaters' errors are. If debater A gets a question wrong with probability p, and debaters are fully independent, the probability that all 3 debaters get it wrong is p^3. For p=0.3 (70% individual accuracy), the debate error rate drops to 0.027 (97.3% accuracy with majority vote).
But this only works if errors are truly independent. If debaters share training data biases, their error correlation is high and the benefit shrinks. This is why model diversity (different vendors, different architectures) produces better results than running the same model three times with different system prompts. Same-model debate still helps (error correlation is typically 0.3-0.5, not 1.0), but cross-vendor debate pushes correlation closer to 0.1-0.2.
How It Works
The Debate Loop
The pattern follows a simple loop: generate independently, share answers, critique, update, check for convergence.
Round 0 (Independent Generation): Each debater receives the same prompt and generates an answer in isolation. No debater sees any other's output. This is the critical step that ensures error independence. Using different system prompts (e.g., "You are a cautious analyst" vs "You are an optimistic analyst") can further increase diversity.
Round N (Debate): Each debater receives its own previous answer plus all other debaters' previous answers. The prompt asks it to: (1) identify flaws in opponents' reasoning, (2) defend or update its own position with evidence, and (3) produce a revised answer. The debate prompt should explicitly encourage changing position when confronted with better evidence.
Termination: Either stop after a fixed number of rounds (2-3 is typical) or detect convergence (all debaters produce the same answer). Fixed rounds are simpler and more predictable for cost budgeting. Convergence detection is more token-efficient when debaters agree quickly.
Debate Variants
Round-Robin (default). All debaters see all other answers each round. Simple, effective, scales to 2-5 debaters. Beyond 5, context windows fill with too many competing positions and the quality of critique degrades.
Structured Roles. Assign specific roles: proposer, critic, and judge. The proposer generates, the critic attacks, and the judge resolves. This variant is faster (fewer total LLM calls) but less robust because the proposer and critic may develop complementary blind spots. I've used this variant for code review where the roles map cleanly to "author" and "reviewer."
Tournament Bracket. For complex tasks, run multiple two-debater matches, then have winners debate each other. Useful when you have 8+ candidate solutions and need to narrow down efficiently. Each round halves the field.
Diverse Model Ensembles. Use different models as debaters (Claude as debater A, GPT-4 as debater B, Gemini as debater C). Different models have different biases and knowledge gaps, so their error profiles are more independent. This is the strongest variant for factual accuracy but adds operational complexity of managing multiple model integrations.
The Judge Mechanism
After debate rounds, you need a final answer. Three options:
Majority vote. The simplest approach. If 2 of 3 debaters converge on the same answer, that's the output. Fast, no extra LLM calls. Fails when the minority debater is actually right.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.