Stochastic multi-agent consensus
Learn how spawning parallel agents with varied prompts exploits LLM stochasticity to traverse broader solution spaces, surfacing rare ideas that single queries miss.
TL;DR
- Spawn N sub-agents (typically 5-10) with slight prompt variations on the same question, then aggregate their outputs by frequency to produce a consensus report with three tiers: consensus, divergent, and outlier items.
- Exploits a fundamental LLM property: temperature sampling means each run explores a different region of the probability distribution. A single query covers roughly 5-10% of the solution search space; 10 varied queries can cover 60-70%.
- Outlier ideas (mentioned by only 1-2 agents) are the real payoff. They're either brilliant insights that only surface in rare sampling paths, or hallucinations. Both are worth investigating.
- Running 10 agents in parallel takes the same wall-clock time as running 1 agent. The latency cost is near-zero; the token cost is 10x.
- Total cost for a 10-agent consensus run: roughly $0.50-$1.00 at Sonnet 4.6 pricing. Reserve for strategic decisions, not routine tasks.
The Problem It Solves
You ask Claude to suggest growth strategies for your SaaS product. The response is competent, relevant, and completely predictable: SEO, content marketing, referral programs, product-led growth. You've already tried all of these. You needed the non-obvious idea, the one that would actually change your trajectory. But a single query only samples one path through the model's probability distribution.
Run the same prompt again with temperature 0.7 and you get a slightly different list, but still within the same cluster of "obvious" ideas. The model's most probable outputs dominate every sample. The brilliant-but-low-probability ideas (the ones in the 5th or 10th percentile of the output distribution) almost never surface because they lose the sampling race to higher-probability tokens at every generation step.
I've watched teams burn hours re-prompting the same model, tweaking phrasing, hoping to stumble onto something novel. The problem isn't the model's capability. The knowledge is in the weights. The problem is that single-sample inference is a lossy bottleneck that systematically filters out rare ideas.
What Is It?
Stochastic multi-agent consensus spawns multiple sub-agents with slight prompt variations on the same question, runs them in parallel with zero shared state, then feeds all responses to an orchestrator that aggregates results by frequency into consensus (high-agreement), divergent (partial agreement), and outlier (rare) tiers.
Think of it like polling a panel of experts who've each been briefed slightly differently. One expert was told to think conservatively. Another was told to focus on what's measurable. A third was told to challenge every assumption. They all answer the same core question, but their different analytical lenses cause them to surface different parts of the solution space. When you collect all their answers, the ideas that multiple experts independently suggested are high-confidence. The ideas only one expert mentioned are either that expert's unique insight, or a blind alley. Both are worth examining.
The mathematical reason this works comes from probability theory. If a single query samples a region of size $r$ from a total solution space $S$, then $N$ queries with varied framings together cover approximately $1 - (1 - r/S)^N$ of the space. For $r/S = 0.1$ (each query covers 10%), $N = 10$ queries cover $1 - 0.9^10 \approx 0.65$ or 65% of the space. The coverage follows a diminishing returns curve, so going from 1 to 5 agents is transformative, 5 to 10 is solid, and 10 to 20 yields marginal gains.
How It Works
Step 1: Prompt variation design
The orchestrator takes a single user question and generates N prompt variations. These are not different questions. They're the same question viewed through different analytical lenses. Each lens biases the model toward a different region of the output distribution.
Five proven variation strategies:
| Strategy | Example Framing | What It Surfaces |
|---|---|---|
| Analytical lens | "Analyze conservatively" vs "Assume unlimited resources" | Risk-averse vs ambitious ideas |
| Role-based | "As a growth marketer" vs "As a CFO" vs "As a data scientist" | Domain-specific insights |
| Constraint-based | "Given tight budget" vs "Given tight timeline" vs "Given limited team" | Constraint-optimized solutions |
| Perspective shift | "From the user's view" vs "From a competitor's view" vs "From an investor's view" | Stakeholder-specific blind spots |
| Contrarian | "Challenge every assumption" vs "What would fail first?" | Non-obvious risks and alternatives |
I've found that mixing 2-3 strategies across a 10-agent run produces the best coverage. Pure role-based variation alone tends to produce overlapping outputs because roles share assumptions. Combining role-based with contrarian pushes agents further apart in the solution space.
Step 2: Parallel sub-agent execution
Each sub-agent receives its unique prompt variation and runs in a completely isolated context window. Zero shared state between agents is critical. If agents share any intermediate reasoning, their outputs become correlated and the diversity benefit collapses.
The wall-clock time for 10 parallel agents is the same as 1 agent (whichever finishes last, typically 5-15 seconds). Compare this to running 10 sequential queries, which would take 50-150 seconds. Parallelization gives you 10x information density at 1x latency.
Step 3: Orchestrator synthesis
The orchestrator (use the strongest available model: Opus 4.6 or GPT 5.4) receives all N responses and performs three operations:
- Extract discrete ideas from each response. This converts prose into a structured list of actionable items.
- Frequency analysis: count how many agents independently suggested each idea.
- Tier classification: bucket each idea into one of three tiers based on frequency.
The three-tier output:
| Tier | Frequency | Confidence | Action |
|---|---|---|---|
| Consensus | 7+ of 10 agents | High. Cross-validated by independent runs. | Likely worth doing. |
| Divergent | 3-6 of 10 agents | Medium. Some framings surface it, others don't. | Needs human judgment. |
| Outlier | 1-2 of 10 agents | Low. Either brilliant or hallucinated. | Investigate before acting. |
I keep coming back to the outlier tier because it's where the real value hides. A consensus item ("improve onboarding flow") is probably something your team already knows. An outlier item ("partner with a complementary product for joint activation campaigns") might be the insight that changes your strategy. Or it might be a hallucination. The orchestrator can't tell you which. That's a feature, not a bug: it forces human evaluation on the ideas that actually need it.
Animated consensus pipeline
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.