Self-critique evaluator loop
Add a dedicated evaluator LLM that grades agent outputs against a rubric before they leave the system, catching errors that the generating model misses.
TL;DR
- A self-critique evaluator loop adds a separate evaluator (different model or different prompt) that scores the generator's output against an explicit rubric before the output reaches the user or downstream system.
- The evaluator uses a structured rubric with 3-5 scored dimensions (correctness, completeness, safety, style), not a vague "is this good?" prompt. Each dimension gets a 1-5 score with specific criteria per level.
- The generate-evaluate-revise cycle typically runs 2-3 rounds. The first round catches 60-70% of issues. Diminishing returns hit hard after round 3.
- Using a separate evaluator model (not the same model with a different prompt) creates genuine adversarial tension. The evaluator has no sunk-cost bias defending the generator's output.
- Constitutional AI is this pattern at scale: the evaluator prompt IS a constitution, defining what "good" means along every dimension.
- Limitation: the evaluator can only catch issues it's been told to look for. Rubric blindspots become system blindspots. And same-model evaluation risks sycophancy bias where the evaluator systematically agrees with the generator.
The Problem It Solves
Your customer support agent generates a response to an angry user who received the wrong product. The response is polite, grammatically correct, and mentions the return policy. It also contains a factual error: it quotes the wrong refund timeline (says 14 days when the actual policy is 30 days). The agent doesn't know the response is wrong because it generated the answer from imperfect context retrieval and has no mechanism to verify its own claims.
You add a reflection step: "Review your response and improve it." The agent re-reads its own work and says "Looks good." Of course it does. The same model that wrote the wrong timeline doesn't know the timeline is wrong. Self-reflection without external criteria is like proofreading your own essay for factual errors when you don't know the facts.
This is the core problem: generators can't reliably evaluate their own output because they have the same knowledge gaps and biases that caused the errors in the first place. Open-ended "improve this" prompts lack direction. The model doesn't know what to look for, so it polishes surface-level issues (grammar, phrasing) while missing substantive errors (incorrect facts, missing edge cases, security vulnerabilities).
I've watched teams deploy reflection loops and celebrate the "improvement" without realizing the model was just rephrasing the same wrong answer in more confident language. Structured evaluation with explicit criteria changes the game entirely.
What Is It?
The self-critique evaluator loop separates generation from evaluation by adding a dedicated evaluator (a different LLM, or the same LLM with a specialized evaluation prompt) that scores the generator's output against a structured rubric. If the score falls below a threshold, the evaluator provides targeted feedback, and the generator revises. The loop continues until the evaluator passes the output or a round limit is reached.
Think of it as the relationship between a chef and a food critic. The chef creates the dish (generation). The food critic evaluates it against explicit criteria: presentation, flavor balance, temperature, portion size (the rubric). The critic doesn't cook, but they know exactly what "good" looks like. If the dish fails on flavor balance, the critic says "too salty, reduce by 30%," and the chef adjusts. A different person evaluating with clear criteria catches things the creator misses.
The key distinction from a generic reflection loop: the evaluator uses a structured rubric with explicit scoring criteria, not an open-ended "improve this" instruction. The rubric makes evaluation deterministic, debuggable, and auditable. You can inspect exactly which dimension failed and why.
How It Works
Rubric design: the foundation of evaluation quality
The rubric is the most important component. A bad rubric produces a bad evaluator regardless of model quality. Design the rubric with 3-5 dimensions, each with a 1-5 scoring scale and specific criteria per level.
Here's a rubric for a customer support agent:
| Dimension | Score 1 (Fail) | Score 3 (Acceptable) | Score 5 (Excellent) | Weight |
|---|---|---|---|---|
| Correctness | Contains factual errors | Mostly accurate, minor imprecision | All claims verified, precise | 0.35 |
| Completeness | Misses the core question | Answers the question, omits edge cases | Covers question + anticipates follow-ups | 0.25 |
| Safety | Contains harmful/biased content | Neutral, no harm | Proactively inclusive, flags risks | 0.20 |
| Tone | Dismissive or robotic | Professional and clear | Empathetic, matches user emotion | 0.10 |
| Actionability | No next steps provided | Generic next steps | Specific, personalized action items | 0.10 |
The pass threshold is a policy decision. "All dimensions β₯ 3" is conservative. "Weighted average β₯ 3.5 with no dimension below 2" balances quality with throughput. I recommend starting strict and relaxing as you understand the evaluator's accuracy.
One mistake I see repeatedly: making the rubric too granular. Ten dimensions with 10-point scales means the evaluator spends more tokens scoring than the generator spent creating. Keep it to 3-5 dimensions with 1-5 scales. The constraint forces you to identify what actually matters.
The generate-evaluate-revise cycle
The core loop has three phases that repeat until termination.
Phase 1: Generate. The generator LLM produces output from the task input plus any revision feedback from prior rounds. On the first round, the generator sees only the original task. On subsequent rounds, it sees the task plus the evaluator's critique from the previous round.
Phase 2: Evaluate. The evaluator LLM receives the generated output and the rubric. It scores each dimension independently and provides a text explanation for any dimension below the threshold. The evaluator's output is structured (JSON or a fixed format), not freeform text.
Phase 3: Decide. If all dimensions pass, the output is approved. If any dimension fails and the round limit hasn't been reached, the evaluator's feedback feeds back into the generator for revision. If the round limit is reached, the best-scoring version so far is either returned with a quality flag or escalated to a human.
def evaluate_output(evaluator, output, rubric, context):
"""Score output against rubric using evaluator LLM."""
eval_prompt = f"""Score this output against each rubric dimension.
RUBRIC:
{format_rubric(rubric)}
ORIGINAL TASK:
{context.task}
OUTPUT TO EVALUATE:
{output}
Respond in this exact JSON format:
{{
"scores": {{
"correctness": {{"score": 1-5, "explanation": "..."}},
"completeness": {{"score": 1-5, "explanation": "..."}},
"safety": {{"score": 1-5, "explanation": "..."}},
"tone": {{"score": 1-5, "explanation": "..."}},
"actionability": {{"score": 1-5, "explanation": "..."}}
}},
"overall_pass": true/false,
"revision_guidance": "Specific changes needed (empty if passing)"
}}"""
return evaluator.generate(eval_prompt, response_format="json")
Evaluator model selection strategies
The choice of evaluator model is a critical design decision with three viable strategies, each with distinct tradeoffs.
Same model, different prompt (cheapest). Use the same LLM for generation and evaluation but with a completely different system prompt. The evaluator prompt focuses purely on scoring against the rubric with no access to the generation reasoning. Cost is low (one additional LLM call per round), but sycophancy risk is highest because the model shares the same knowledge gaps and biases.
Different model (balanced). Use a separate model as evaluator. For example, GPT-4o generates and Claude evaluates, or vice versa. The different training data and alignment create genuine diversity of judgment. This is the sweet spot for most production systems: moderate cost with meaningful error diversity.
Specialized fine-tuned evaluator (most accurate, most expensive). Train a dedicated evaluation model on human-graded examples. CriticGPT (OpenAI's specialized critic model) is an example. The fine-tuned evaluator achieves the highest accuracy on the specific rubric but requires training data and maintenance.
For your interview: default to cross-model evaluation when discussing this pattern. It strikes the best balance and shows you understand why model diversity matters. Mention fine-tuned evaluators as the "if you need production-grade accuracy" option.
Sycophancy and how to mitigate it
Sycophancy is the evaluator's tendency to agree with the generator, especially when they share the same base model. The evaluator sees plausible-looking output and rates it highly because the same training that made the generator produce that output also makes the evaluator think it's correct.
Three mitigations work in practice. First, cross-model evaluation (discussed above) is the strongest defense. Different training data means different blind spots. Second, adversarial prompting: explicitly instruct the evaluator to "find at least one issue" or "assume the output contains errors and identify them." This counteracts the default agreeable behavior. Third, calibration anchoring: include known-good and known-bad examples in the evaluator prompt so it has a reference scale. Without anchors, the evaluator's scores drift toward the middle.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.