Output verification loop

TL;DR

LLMs hallucinate. For agentic tasks where outputs drive downstream actions, a hallucinated fact can cause real harm: filing incorrect information, generating wrong code, or sending misleading communications.
Output verification extracts individual claims from the LLM's response and checks each claim against a grounding source before the output is accepted or acted upon.
A "claim" here is any atomic fact the output asserts: a number, a name, a date, a causal relationship. Claims that can't be grounded get flagged, not silently passed through.
The verification step can be deterministic (database lookup), retrieval-based (RAG search), or LLM-based (secondary model checks against the original sources). Use the cheapest method that's reliable enough.
This is different from reflection. Reflection improves reasoning quality. Output verification checks factual accuracy against external evidence. You often want both.

Your research agent summarizes a company's financial performance from 10-K filings. The output says: "Revenue grew 23% year-over-year, reaching $4.7 billion in 2024." The actual 10-K shows revenue of $4.2 billion with 19% growth. The model hallucinated slightly, rounding numbers and misremembering figures from training data that conflicted with the provided documents.

An analyst acts on the summary. The hallucinated number makes it into a report.

Reflection loops won't catch this. The model doesn't "know" its numbers were wrong. It generated them with high confidence. The reflection step will likely approve them. What you need is to check the claimed number against the source document it was derived from.

Output verification is the grounding pass that reflection can't provide.

What is it?

Output verification is a post-generation pipeline that:

Extracts individual verifiable claims from the LLM's output.
For each claim, retrieves the evidence that should support it.
Asks a verifier (deterministic rule, retrieval search, or secondary LLM) whether the claim matches the evidence.
Returns a per-claim trust score and flags claims that fail verification.

The final output can be: accepted as-is (all claims pass), returned with flags (some claims uncertain), or rejected and regenerated (critical claims failed). The action taken depends on the use case and trust threshold.

How it works

The verification pipeline

Claim extraction

Use a separate LLM call to decompose the output into a list of atomic, verifiable claims. The extraction prompt is straightforward:

Extract all verifiable factual claims from the following text as a list.
A claim is any statement asserting a specific fact: a number, name, date,
percentage, causal relationship, or event.

Text: {output}

Return as JSON array of strings. Each claim should be a single sentence.

Verification methods (in order of cost)

1. Deterministic lookup (cheapest): for claims that reference structured data (numbers, names, dates), query the database directly. If the output claims "order #12345 was $299.99," look up order #12345 in the database. Boolean pass/fail, near-zero latency.

2. Retrieval-based verification: for claims derived from unstructured sources (documents, web pages), retrieve the most relevant passages and check whether the claim is supported. Uses the same embedding/search infrastructure as RAG.

3. LLM-based verification: for complex claims requiring reasoning, a secondary LLM call receives the claim and the retrieved evidence and returns a verdict with a confidence score. Most expensive but handles nuanced claims.

Use the cheapest method that works for each claim type. Many production systems use deterministic lookup for factual claims and skip LLM verification for anything that can be database-checked.

Trust thresholds and actions

Not all claims warrant the same response to failure:

Claim type	Failure action
Critical numbers (financial, medical)	Block output, regenerate
Supporting claims (contextual background)	Flag with correction
Stylistic assertions ("commonly known as")	Accept with low-confidence annotation
Unverifiable opinions	Accept, mark as unverified

Document your trust thresholds. "Critical claims must pass at 0.95 confidence" is a product decision, not a technical one.

When to use it

Use output verification when:

Your agent's outputs include factual claims derived from specific source documents (RAG outputs, research summaries, financial reports).
Acting on a wrong claim has real consequences (downstream decisions, customer-facing communications, automated actions).
You have a ground-truth source to verify against (either a database or the documents used to generate the output).

Skip it when:

Outputs are purely generative and have no verifiable factual component (creative writing, code generation where you'll run tests anyway).
The use case tolerates hallucination at the application layer (brainstorming, drafting where a human reviews the output regardless).
The cost of verification exceeds the cost of occasional hallucination errors (careful analysis required, and this is often wrong to assume).

Implementation sketch

This is a simplified implementation showing the core mechanism. Production systems add retry logic, caching for repeated claims, and batch verification.

def verify_output(output: str, sources: list[Document]) -> VerifiedOutput:
    # Step 1: Extract atomic claims from the LLM output
    claims = extract_claims(output)  # Separate LLM call

    results = []
    for claim in claims:
        # Step 2: Choose cheapest verification method per claim type
        if claim.has_structured_data():
            verdict = db_lookup(claim)             # Deterministic, ~1ms
        elif claim.references_document():
            evidence = retrieve_passages(claim, sources)
            verdict = check_support(claim, evidence) # Embedding search
        else:
            verdict = llm_verify(claim, sources)    # Secondary LLM

        results.append(ClaimResult(claim=claim, verdict=verdict))

    # Step 3: Apply trust thresholds and decide action
    critical_failures = [r for r in results if r.is_critical and r.verdict.failed]
    if critical_failures:
        return VerifiedOutput(action="regenerate", failures=critical_failures)

    flagged = [r for r in results if r.verdict.confidence < THRESHOLD]
    if flagged:
        return VerifiedOutput(action="flag", output=output, flags=flagged)

    return VerifiedOutput(action="accept", output=output)

TL;DR

LLMs hallucinate. For agentic tasks where outputs drive downstream actions, a hallucinated fact can cause real harm: filing incorrect information, generating wrong code, or sending misleading communications.
Output verification extracts individual claims from the LLM's response and checks each claim against a grounding source before the output is accepted or acted upon.
A "claim" here is any atomic fact the output asserts: a number, a name, a date, a causal relationship. Claims that can't be grounded get flagged, not silently passed through.
The verification step can be deterministic (database lookup), retrieval-based (RAG search), or LLM-based (secondary model checks against the original sources). Use the cheapest method that's reliable enough.
This is different from reflection. Reflection improves reasoning quality. Output verification checks factual accuracy against external evidence. You often want both.

The problem it solves

An analyst acts on the summary. The hallucinated number makes it into a report.

Output verification is the grounding pass that reflection can't provide.

What is it?

Output verification is a post-generation pipeline that:

Extracts individual verifiable claims from the LLM's output.
For each claim, retrieves the evidence that should support it.
Asks a verifier (deterministic rule, retrieval search, or secondary LLM) whether the claim matches the evidence.
Returns a per-claim trust score and flags claims that fail verification.

How it works

The verification pipeline

Claim extraction

Use a separate LLM call to decompose the output into a list of atomic, verifiable claims. The extraction prompt is straightforward:

Extract all verifiable factual claims from the following text as a list.
A claim is any statement asserting a specific fact: a number, name, date,
percentage, causal relationship, or event.

Text: {output}

Return as JSON array of strings. Each claim should be a single sentence.

Verification methods (in order of cost)

Use the cheapest method that works for each claim type. Many production systems use deterministic lookup for factual claims and skip LLM verification for anything that can be database-checked.

Trust thresholds and actions

Not all claims warrant the same response to failure:

Claim type	Failure action
Critical numbers (financial, medical)	Block output, regenerate
Supporting claims (contextual background)	Flag with correction
Stylistic assertions ("commonly known as")	Accept with low-confidence annotation
Unverifiable opinions	Accept, mark as unverified

Document your trust thresholds. "Critical claims must pass at 0.95 confidence" is a product decision, not a technical one.

When to use it

Use output verification when:

Your agent's outputs include factual claims derived from specific source documents (RAG outputs, research summaries, financial reports).
Acting on a wrong claim has real consequences (downstream decisions, customer-facing communications, automated actions).
You have a ground-truth source to verify against (either a database or the documents used to generate the output).

Skip it when:

Outputs are purely generative and have no verifiable factual component (creative writing, code generation where you'll run tests anyway).
The use case tolerates hallucination at the application layer (brainstorming, drafting where a human reviews the output regardless).
The cost of verification exceeds the cost of occasional hallucination errors (careful analysis required, and this is often wrong to assume).

Implementation sketch

This is a simplified implementation showing the core mechanism. Production systems add retry logic, caching for repeated claims, and batch verification.

def verify_output(output: str, sources: list[Document]) -> VerifiedOutput:
    # Step 1: Extract atomic claims from the LLM output
    claims = extract_claims(output)  # Separate LLM call

    results = []
    for claim in claims:
        # Step 2: Choose cheapest verification method per claim type
        if claim.has_structured_data():
            verdict = db_lookup(claim)             # Deterministic, ~1ms
        elif claim.references_document():
            evidence = retrieve_passages(claim, sources)
            verdict = check_support(claim, evidence) # Embedding search
        else:
            verdict = llm_verify(claim, sources)    # Secondary LLM

        results.append(ClaimResult(claim=claim, verdict=verdict))

    # Step 3: Apply trust thresholds and decide action
    critical_failures = [r for r in results if r.is_critical and r.verdict.failed]
    if critical_failures:
        return VerifiedOutput(action="regenerate", failures=critical_failures)

    flagged = [r for r in results if r.verdict.confidence < THRESHOLD]
    if flagged:
        return VerifiedOutput(action="flag", output=output, flags=flagged)

    return VerifiedOutput(action="accept", output=output)

Output verification loop

TL;DR

The problem it solves

What is it?

How it works

The verification pipeline

Claim extraction

Verification methods (in order of cost)

Trust thresholds and actions

When to use it

Implementation sketch

Continue Reading with Premium

Comments

Output verification loop

TL;DR

The problem it solves

What is it?

How it works

The verification pipeline

Claim extraction

Verification methods (in order of cost)

Trust thresholds and actions

When to use it

Implementation sketch

Continue Reading with Premium

Comments