Design a multi-agent research system
Walk through designing a deep research agent like Perplexity Deep Research that orchestrates multiple specialized subagents to produce a comprehensive, cited research report.
TL;DR
- An orchestrator decomposes the user's research question into 5-10 sub-questions, then dispatches them to parallel subagents. End-to-end latency equals the slowest subagent (30-40s), not the sum (200s+).
- Each subagent has access to a tool registry (web search, RAG retrieval, calculator). Tools are stateless and MCP-compatible.
- Every claim in the final report must link to a source. Sourceless claims are rejected by the grounding checker, not guessed at. Source-grounded generation reduces hallucinated citations from 70% to under 5%.
- A human approval checkpoint lets the user review the research plan (sub-questions + sources found) before the orchestrator synthesizes the final report.
- Conflicting findings from different subagents are surfaced explicitly with evidence weights rather than silently resolved. This mirrors how academic meta-analyses handle contradictory evidence.
- Hierarchical budget allocation keeps cost predictable: $2.00 per job with real-time monitoring and adaptive redistribution across agents.
Requirements
Functional requirements
- The user submits a research question and receives a 2-3 page written report with inline citations.
- The system searches the web, internal knowledge bases, and structured data sources for relevant information.
- Every factual claim in the report links to a cited source. Claims without sources are flagged or dropped.
- The user can review and modify the research plan (sub-questions, sources) before the system generates the final report.
- Conflicting information from different sources is surfaced to the user rather than arbitrarily resolved.
Non-functional requirements
- End-to-end latency: 30-120 seconds for a comprehensive report (acceptable, not interactive).
- Each subagent has a hard 30-second timeout. Timed-out agents return partial findings rather than blocking the pipeline.
- The system runs up to 10 subagents in parallel.
- All retrieved source material is stored for 90 days for auditability.
- Cost per research job: under $2 at GPT-4o pricing (approximately 500K tokens total for 10 subagents).
- Citation accuracy: 95%+ of cited URLs must be alive and content-matched at report delivery time.
The hardest engineering problem is not the LLM calls. It is preventing hallucinated citations. Over 70% of LLM-generated citations in ungrounded systems are fabricated or point to the wrong content. Source-grounded generation with post-hoc verification is the only reliable approach.
The core entities
ResearchJob
job_id,user_id,original_question,status(decomposing | researching | awaiting_approval | synthesizing | complete),report_text,created_at,completed_at
SubQuestion
subq_id,job_id,question_text,assigned_agent_id,question_type(factual | comparative | statistical | opinion),status,findings_json,sources[],confidence_score,elapsed_ms,token_budget
Source
source_id,url,title,retrieved_text,retrieved_at,relevance_score,authority_score,subq_id
ReportClaim
claim_id,report_id,claim_text,source_ids[],grounding_score,verified,verbatim_excerpt
API design
POST /api/research -- submit a research question
Request: { "question": "What are the best practices for database sharding in 2025?", "depth": "comprehensive" }
Response: { "job_id": "job_abc", "status": "decomposing", "estimated_seconds": 60 }
GET /api/research/{job_id}/plan -- review research plan before synthesis
Response: {
"sub_questions": [
{ "id": "subq_1", "text": "What are the main sharding strategies?", "type": "factual" },
{ "id": "subq_2", "text": "What do Postgres and MongoDB support natively?", "type": "comparative" }
],
"sources_found": 34,
"estimated_cost": "$1.20",
"awaiting_approval": true
}
POST /api/research/{job_id}/approve -- approve or modify the plan
Request: { "approved": true, "remove_subquestion_ids": ["subq_3"], "add_questions": ["What are the failure modes?"] }
Response: { "job_id": "job_abc", "status": "synthesizing" }
GET /api/research/{job_id}/report -- retrieve the completed report
Response: {
"report_text": "...",
"citations": [{ "id": "src_1", "url": "...", "verified": true }],
"conflicts": [{ "claim": "...", "sources": ["src_2", "src_5"], "resolution": "surfaced" }],
"grounding_score": 0.94,
"total_cost": "$1.45"
}
GET /api/research/{job_id}/stream -- SSE stream for real-time progress
data: { "phase": "researching", "agent": "subagent_3", "progress": "Found 8 relevant sources" }
data: { "phase": "synthesizing", "progress": "Writing section 2 of 5" }
High-level design
The orchestrator receives the user question and calls a capable LLM (GPT-4o or Claude Sonnet) to decompose it into 5-10 sub-questions that collectively cover the topic. Decomposition is not trivial: a good orchestrator balances breadth (covering different angles) with specificity (sub-questions narrow enough for a single agent to answer well). Poor decomposition produces vague sub-questions that return generic web search results.
The orchestrator dispatches all sub-questions to the subagent pool simultaneously. Each subagent is an independent LLM process with access to the tool registry. Subagents run in parallel, with a 30-second timeout enforced by the orchestrator. A timed-out agent returns whatever partial findings it has accumulated. The orchestrator waits for all agents (or their timeouts) before proceeding.
Before synthesis, the system presents the research plan and sources found to the user as a human-in-the-loop checkpoint. This is the approval moment: the user can prune irrelevant sub-questions or add missing angles. After approval, the orchestrator deduplicates sources, identifies conflicting findings, and passes everything to a synthesis LLM that writes the final report with inline source citations.
The data flows through three distinct phases. Phase 1 (decomposition) takes 2-5 seconds: the orchestrator LLM generates sub-questions, classifies them by type, and estimates token budgets. Phase 2 (research) takes 20-30 seconds: all subagents run in parallel, each making 2-3 tool calls (search, fetch, summarize). Phase 3 (synthesis) takes 10-20 seconds: deduplication, conflict detection, human approval, report writing, and grounding verification.
The human checkpoint between phases 2 and 3 is the critical design decision. Without it, the system produces reports the user did not ask for. With it, the user sees the sub-questions and source count, can remove irrelevant angles, and add missing ones. This increases user trust and report relevance at the cost of 30-60 seconds of human review time. For automated research workflows (no human in the loop), this checkpoint becomes an optional quality gate controlled by a confidence threshold.
The AI-specific challenges
Parallel execution: latency bound by the slowest agent
Ten subagents running sequentially takes 10x the time of one. Running them in parallel means end-to-end latency is determined by the slowest subagent, not the sum. In practice, one or two agents always hit their 30-second timeout, so total latency clusters around 30-40 seconds even with 10 agents.
The 30-second timeout is calibrated for web search round-trips: a web search call returns in 1-3 seconds, document retrieval in 3-5 seconds, and an LLM synthesis call for a single sub-question in 5-10 seconds. Two or three tool calls per agent is achievable in 30 seconds. An agent that tries to search 10 URLs will time out. Good agent architecture keeps subagents narrowly focused.
Citation tracking through synthesis
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.