Output verification loop
Learn how claim-level output verification catches hallucinations before agents act on them, by grounding each claim against evidence sources and returning per-claim trust scores.
TL;DR
- LLMs hallucinate. For agentic tasks where outputs drive downstream actions, a hallucinated fact can cause real harm: filing incorrect information, generating wrong code, or sending misleading communications.
- Output verification extracts individual claims from the LLM's response and checks each claim against a grounding source before the output is accepted or acted upon.
- A "claim" here is any atomic fact the output asserts: a number, a name, a date, a causal relationship. Claims that can't be grounded get flagged, not silently passed through.
- The verification step can be deterministic (database lookup), retrieval-based (RAG search), or LLM-based (secondary model checks against the original sources). Use the cheapest method that's reliable enough.
- This is different from reflection. Reflection improves reasoning quality. Output verification checks factual accuracy against external evidence. You often want both.
The problem it solves
Your research agent summarizes a company's financial performance from 10-K filings. The output says: "Revenue grew 23% year-over-year, reaching $4.7 billion in 2024." The actual 10-K shows revenue of $4.2 billion with 19% growth. The model hallucinated slightly, rounding numbers and misremembering figures from training data that conflicted with the provided documents.
An analyst acts on the summary. The hallucinated number makes it into a report.
Reflection loops won't catch this. The model doesn't "know" its numbers were wrong. It generated them with high confidence. The reflection step will likely approve them. What you need is to check the claimed number against the source document it was derived from.
Output verification is the grounding pass that reflection can't provide.
What is it?
Output verification is a post-generation pipeline that:
- Extracts individual verifiable claims from the LLM's output.
- For each claim, retrieves the evidence that should support it.
- Asks a verifier (deterministic rule, retrieval search, or secondary LLM) whether the claim matches the evidence.
- Returns a per-claim trust score and flags claims that fail verification.
The final output can be: accepted as-is (all claims pass), returned with flags (some claims uncertain), or rejected and regenerated (critical claims failed). The action taken depends on the use case and trust threshold.
How it works
The verification pipeline
Claim extraction
Use a separate LLM call to decompose the output into a list of atomic, verifiable claims. The extraction prompt is straightforward:
Extract all verifiable factual claims from the following text as a list.
A claim is any statement asserting a specific fact: a number, name, date,
percentage, causal relationship, or event.
Text: {output}
Return as JSON array of strings. Each claim should be a single sentence.
Verification methods (in order of cost)
1. Deterministic lookup (cheapest): for claims that reference structured data (numbers, names, dates), query the database directly. If the output claims "order #12345 was $299.99," look up order #12345 in the database. Boolean pass/fail, near-zero latency.
2. Retrieval-based verification: for claims derived from unstructured sources (documents, web pages), retrieve the most relevant passages and check whether the claim is supported. Uses the same embedding/search infrastructure as RAG.
3. LLM-based verification: for complex claims requiring reasoning, a secondary LLM call receives the claim and the retrieved evidence and returns a verdict with a confidence score. Most expensive but handles nuanced claims.
Use the cheapest method that works for each claim type. Many production systems use deterministic lookup for factual claims and skip LLM verification for anything that can be database-checked.
Trust thresholds and actions
Not all claims warrant the same response to failure:
| Claim type | Failure action |
|---|---|
| Critical numbers (financial, medical) | Block output, regenerate |
| Supporting claims (contextual background) | Flag with correction |
| Stylistic assertions ("commonly known as") | Accept with low-confidence annotation |
| Unverifiable opinions | Accept, mark as unverified |
Document your trust thresholds. "Critical claims must pass at 0.95 confidence" is a product decision, not a technical one.
When to use it
Use output verification when:
- Your agent's outputs include factual claims derived from specific source documents (RAG outputs, research summaries, financial reports).
- Acting on a wrong claim has real consequences (downstream decisions, customer-facing communications, automated actions).
- You have a ground-truth source to verify against (either a database or the documents used to generate the output).
Skip it when:
- Outputs are purely generative and have no verifiable factual component (creative writing, code generation where you'll run tests anyway).
- The use case tolerates hallucination at the application layer (brainstorming, drafting where a human reviews the output regardless).
- The cost of verification exceeds the cost of occasional hallucination errors (careful analysis required, and this is often wrong to assume).
Implementation sketch
This is a simplified implementation showing the core mechanism. Production systems add retry logic, caching for repeated claims, and batch verification.
def verify_output(output: str, sources: list[Document]) -> VerifiedOutput:
# Step 1: Extract atomic claims from the LLM output
claims = extract_claims(output) # Separate LLM call
results = []
for claim in claims:
# Step 2: Choose cheapest verification method per claim type
if claim.has_structured_data():
verdict = db_lookup(claim) # Deterministic, ~1ms
elif claim.references_document():
evidence = retrieve_passages(claim, sources)
verdict = check_support(claim, evidence) # Embedding search
else:
verdict = llm_verify(claim, sources) # Secondary LLM
results.append(ClaimResult(claim=claim, verdict=verdict))
# Step 3: Apply trust thresholds and decide action
critical_failures = [r for r in results if r.is_critical and r.verdict.failed]
if critical_failures:
return VerifiedOutput(action="regenerate", failures=critical_failures)
flagged = [r for r in results if r.verdict.confidence < THRESHOLD]
if flagged:
return VerifiedOutput(action="flag", output=output, flags=flagged)
return VerifiedOutput(action="accept", output=output)
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.