Reflection loop

TL;DR

The reflection loop is a three-step cycle: generate, critique, refine. The model produces an output, then evaluates it against explicit criteria, then revises based on that evaluation.
A dedicated critique prompt consistently outperforms "think step by step" prompting alone. Separating generation from evaluation forces the model into a different reasoning mode.
The critique step should target concrete dimensions: correctness, completeness, tone, constraint satisfaction. Vague critique ("is this good?") produces vague improvements.
Reflection loops have diminishing returns. Two or three rounds usually capture 80% of the quality gain. Beyond that, outputs plateau or drift.
Use a termination condition, not a fixed loop count. Stop when the critic reports the output satisfies all criteria, then add a max-iteration fallback.

Your agent generates a SQL query to answer a business question. The first attempt looks plausible but misses a GROUP BY clause and would return wrong numbers. Without a feedback loop, the agent submits it and the downstream system either crashes or silently returns incorrect data.

The standard fix is "improve the prompt." But prompts can't anticipate every failure mode. Model outputs are probabilistic, and even a well-prompted model produces suboptimal first drafts a significant fraction of the time, especially on complex tasks requiring multiple constraints to be satisfied simultaneously.

The reflection loop treats LLM output as a draft, not a final answer. Just as experienced engineers review their own code before committing, reflection forces the model to evaluate its own work before the output leaves the system.

What is it?

The reflection loop is an iterative self-improvement pattern where an agent generates an output and then uses a separate evaluation pass to critique and refine it. The critique and refinement run inside the same system (often the same model with a different prompt) rather than requiring an external reviewer.

The pattern was formalized in the Reflexion paper (2023), which showed that language agents could use verbal feedback from their own failures to improve performance on coding, decision-making, and reasoning tasks, all without updating model weights.

How it works

The three-phase cycle

Phase 1: Generate. The main prompt produces an initial output. This is a standard LLM call with the task description and any relevant context.

Phase 2: Critique. A separate critique prompt evaluates the output. The critique prompt receives the original task, the generated output, and a rubric for evaluation. It returns specific, actionable feedback, not a pass/fail verdict.

Phase 3: Refine. The refinement prompt receives the original task, the initial output, and the critique. It produces an improved version. This loops back to Phase 2 until the termination condition is met.

Writing an effective critique prompt

The critique prompt is the most important part of the pattern. A weak critique ("Is this answer correct?") produces weak improvements. A strong critique targets specific, evaluatable dimensions:

You are evaluating a SQL query generated to answer this question: {question}

Query to evaluate:
{query}

Check each criterion and respond with PASS or FAIL + one-sentence explanation:
1. Does the query return the correct columns for the question?
2. Are all necessary JOINs present?
3. Are GROUP BY clauses correct when aggregation functions are used?
4. Would this query run without syntax errors?

End your response with: VERDICT: PASS if all criteria pass, FAIL otherwise.

Termination conditions

Don't run a fixed number of iterations. Set a termination condition:

MAX_ITERATIONS = 3
for i in range(MAX_ITERATIONS):
    critique = evaluate(output, rubric)
    if critique.verdict == "PASS":
        break
    output = refine(output, critique.feedback)
return output  # best attempt if MAX_ITERATIONS reached

The max iteration fallback is not optional. Reflection loops can get stuck in cycles where refinement creates new problems while fixing old ones.

Self-reflection vs. external-verifier reflection

Self-reflection: The same model generates both the output and the critique. Cheaper and simpler to implement. Works well when the model has the knowledge to recognize its own errors.

External-verifier reflection: A different model or deterministic system provides the critique. Preferred for high-stakes outputs. A code executor, a test suite, or a specialized classifier gives ground-truth feedback the generating model can't fake.

The strongest setup chains both: deterministic external verification first (does the code compile? do the tests pass?), then LLM self-reflection for quality dimensions the deterministic check can't catch.

When to use it

Use reflection loops when:

The task has clear, evaluatable success criteria (code correctness, schema compliance, factual accuracy).
First-draft quality is critical and latency allows for 2-3 extra LLM calls.
You're generating structured outputs (JSON, SQL, code) where the critique can be partially automated with deterministic checks.
The cost of reviewing bad output downstream exceeds the cost of additional inference.

Don't use reflection loops when:

Latency is a hard constraint. One reflection round typically adds 1-3 seconds.
The task is creative or subjective and there are no clear criteria to evaluate against.
The model consistently produces correct first drafts (validate this with evals before adding complexity).
The failure mode is hallucination, not reasoning. Reflection loops don't reliably fix factual hallucination because the model may critique a hallucination as correct.

Implementation sketch

This is a simplified implementation showing the core mechanism. Production systems add logging, cost tracking, and more sophisticated termination logic.

def reflection_loop(task: str, context: str, rubric: list[str]) -> str:
    MAX_ITERATIONS = 3

    # Phase 1: Generate initial output
    output = generate(task, context)

    for i in range(MAX_ITERATIONS):
        # Phase 2: Critique against explicit rubric
        critique = evaluate(
            output=output,
            task=task,
            rubric=rubric  # e.g. ["correct columns", "valid JOINs", "GROUP BY"]
        )

        if critique.verdict == "PASS":
            return output  # All criteria satisfied

        # Phase 3: Refine based on specific feedback
        output = refine(
            output=output,
            task=task,
            feedback=critique.issues  # Actionable, per-criterion feedback
        )

    return output  # Best attempt after max iterations

The key design decision: the rubric drives the critique. Without explicit criteria, the model defaults to generic "this looks good" feedback. With a rubric, it checks each dimension independently and returns specific, actionable issues.

TL;DR

The reflection loop is a three-step cycle: generate, critique, refine. The model produces an output, then evaluates it against explicit criteria, then revises based on that evaluation.
A dedicated critique prompt consistently outperforms "think step by step" prompting alone. Separating generation from evaluation forces the model into a different reasoning mode.
The critique step should target concrete dimensions: correctness, completeness, tone, constraint satisfaction. Vague critique ("is this good?") produces vague improvements.
Reflection loops have diminishing returns. Two or three rounds usually capture 80% of the quality gain. Beyond that, outputs plateau or drift.
Use a termination condition, not a fixed loop count. Stop when the critic reports the output satisfies all criteria, then add a max-iteration fallback.

The problem it solves

What is it?

How it works

The three-phase cycle

Phase 1: Generate. The main prompt produces an initial output. This is a standard LLM call with the task description and any relevant context.

Writing an effective critique prompt

The critique prompt is the most important part of the pattern. A weak critique ("Is this answer correct?") produces weak improvements. A strong critique targets specific, evaluatable dimensions:

You are evaluating a SQL query generated to answer this question: {question}

Query to evaluate:
{query}

Check each criterion and respond with PASS or FAIL + one-sentence explanation:
1. Does the query return the correct columns for the question?
2. Are all necessary JOINs present?
3. Are GROUP BY clauses correct when aggregation functions are used?
4. Would this query run without syntax errors?

End your response with: VERDICT: PASS if all criteria pass, FAIL otherwise.

Termination conditions

Don't run a fixed number of iterations. Set a termination condition:

MAX_ITERATIONS = 3
for i in range(MAX_ITERATIONS):
    critique = evaluate(output, rubric)
    if critique.verdict == "PASS":
        break
    output = refine(output, critique.feedback)
return output  # best attempt if MAX_ITERATIONS reached

The max iteration fallback is not optional. Reflection loops can get stuck in cycles where refinement creates new problems while fixing old ones.

Self-reflection vs. external-verifier reflection

Self-reflection: The same model generates both the output and the critique. Cheaper and simpler to implement. Works well when the model has the knowledge to recognize its own errors.

When to use it

Use reflection loops when:

The task has clear, evaluatable success criteria (code correctness, schema compliance, factual accuracy).
First-draft quality is critical and latency allows for 2-3 extra LLM calls.
You're generating structured outputs (JSON, SQL, code) where the critique can be partially automated with deterministic checks.
The cost of reviewing bad output downstream exceeds the cost of additional inference.

Don't use reflection loops when:

Latency is a hard constraint. One reflection round typically adds 1-3 seconds.
The task is creative or subjective and there are no clear criteria to evaluate against.
The model consistently produces correct first drafts (validate this with evals before adding complexity).
The failure mode is hallucination, not reasoning. Reflection loops don't reliably fix factual hallucination because the model may critique a hallucination as correct.

Implementation sketch

This is a simplified implementation showing the core mechanism. Production systems add logging, cost tracking, and more sophisticated termination logic.

def reflection_loop(task: str, context: str, rubric: list[str]) -> str:
    MAX_ITERATIONS = 3

    # Phase 1: Generate initial output
    output = generate(task, context)

    for i in range(MAX_ITERATIONS):
        # Phase 2: Critique against explicit rubric
        critique = evaluate(
            output=output,
            task=task,
            rubric=rubric  # e.g. ["correct columns", "valid JOINs", "GROUP BY"]
        )

        if critique.verdict == "PASS":
            return output  # All criteria satisfied

        # Phase 3: Refine based on specific feedback
        output = refine(
            output=output,
            task=task,
            feedback=critique.issues  # Actionable, per-criterion feedback
        )

    return output  # Best attempt after max iterations

Reflection loop

TL;DR

The problem it solves

What is it?

How it works

The three-phase cycle

Writing an effective critique prompt

Termination conditions

Self-reflection vs. external-verifier reflection

When to use it

Implementation sketch

Continue Reading with Premium

Comments

Reflection loop

TL;DR

The problem it solves

What is it?

How it works

The three-phase cycle

Writing an effective critique prompt

Termination conditions

Self-reflection vs. external-verifier reflection

When to use it

Implementation sketch

Continue Reading with Premium

Comments