LLM evals
Learn how to measure LLM application quality with assertion-based tests and LLM-as-judge, why evals come before architecture, and how to build an evaluation pipeline that gates production deploys.
TL;DR
- Evals are the systematic test suite for LLM applications: a dataset of inputs, expected behaviors, and judgment functions that run consistently across versions
- LLM-as-judge (use a stronger model to score a weaker model's outputs) is the most flexible evaluation type, but watch for position bias and verbosity bias
- RAGAS is the standard framework for evaluating RAG pipelines across four dimensions: faithfulness, answer relevancy, context precision, context recall
- Evaluation-Driven Development means writing evals before writing prompts, not after deployment when users complain
- A 5% drop in your primary evaluation metric should block a production deploy, enforced as a CI/CD gate
The problem it solves
You can't unit test an LLM. A function that takes a string and returns a string has no single correct output. "Find all contracts mentioning indemnification" might produce five valid response formats, varying levels of completeness, and subtle errors that look correct on a quick read. An HTTP 200 tells you the API call succeeded, not whether the answer was good.
The consequence is that teams deploy their first working prompt to production and discover quality problems when users start complaining. By then the problem is hard to reproduce, harder to measure, and the fix is guesswork. I've seen this happen on almost every AI team that skips evaluations during the initial build.
Evals break this cycle. They give you a systematic way to answer: "Is this version better than the last one?" before pushing to production. Evals catch regressions the way unit tests catch bugs, but for non-deterministic outputs.
What is it?
An evaluation is a structured test for an LLM system. It has three components: a dataset of inputs (prompts, documents, queries), one or more judgment functions that score the quality of outputs, and a process for running them consistently so results are comparable across versions.
Think of it like a driving test. The test has a fixed route (evaluation dataset), specific skills the examiner checks (judgment functions: parallel parking, lane changes, mirror checks), and a scoring rubric. You don't need a perfect score to pass, but you need to clear the threshold. And critically, you take the same test every time so you can compare results between attempts.
The goal is not to achieve 100% on every test case. It's to have a baseline score you can measure against when you change a prompt, swap a model, update your retrieval pipeline, or add a guardrail. Evals catch regressions. They're quality gates, not scorecards.
How it works
Types of evals
There are four evaluation strategies. Most production systems use a combination.
Exact match checks whether the output equals the expected string exactly. Fast and reliable, but only works for constrained outputs: classification labels, structured extraction, yes/no answers. It breaks for any task where multiple phrasings are valid.
Assertion-based evals test specific properties of the output as boolean functions. "Does the response contain a citation?" "Is the output valid JSON?" "Is the response under 200 words?" Each assertion is cheap (often regex or a small parser) and composable. A response can pass 8 out of 10 assertions, giving you a granular score.
LLM-as-judge uses a powerful model (like GPT-4o or Claude 3.5 Sonnet) to score outputs against a rubric. You pass the input, the output, and scoring criteria. The judge returns a score and optionally a rationale. This handles open-ended tasks where assertions alone can't capture quality, but it's slower, more expensive, and has systematic biases.
Pairwise comparison shows the judge two outputs (unlabeled) and asks which is better. More reliable than absolute scoring because biases partially cancel out when you randomize which response appears first. Use this when comparing two prompt versions or two models head-to-head.
Building an evaluation dataset
Your evals are only as good as your dataset. A weak dataset gives you false confidence.
Start with golden examples: 20-50 representative inputs with expected outputs (or expected properties). These are your regression anchors. When you change anything, these must still pass.
Add production samples: real user queries from logs. These capture the long tail of phrasings, edge cases, and topics you'd never generate synthetically. I've found that 10 real production examples teach you more about failure modes than 100 synthetic ones.
Include adversarial cases: inputs designed to break the system. Prompt injections, ambiguous queries, out-of-scope requests, inputs in unexpected languages. These test your guardrails.
Finally, add edge cases specific to your domain: very long inputs, empty inputs, inputs with special characters, queries that require information from multiple documents.
Start synthetic, replace with real
If you're building before production traffic exists, generate synthetic examples to get started. But plan to replace at least 50% of your evaluation dataset with real user queries within the first month of production. Real data is worth 10x synthetic data for catching actual failure modes.
Assertion-based scoring
Assertion-based scoring is the workhorse of evaluation pipelines. Each assertion is a function that takes the model output and returns true or false.
# Simplified evaluation test case with assertions
def assess_contract_extraction(output: str, expected: dict) -> dict:
results = {}
# Structural assertions
results["valid_json"] = is_valid_json(output)
results["has_parties"] = "parties" in parse_json(output)
results["has_date"] = bool(re.search(r"\d{4}-\d{2}-\d{2}", output))
# Content assertions
parsed = parse_json(output)
results["correct_party_count"] = (
len(parsed.get("parties", [])) == expected["party_count"]
)
results["mentions_indemnification"] = (
"indemnif" in output.lower()
) == expected["has_indemnification"]
# Scoring: pass rate across assertions
score = sum(results.values()) / len(results)
return {"score": score, "details": results}
The power of assertion-based scoring is composability. You stack 5-15 assertions per test case, each testing a different quality dimension. The aggregate score tells you how the system is performing overall, and individual assertion failures pinpoint exactly what broke.
LLM-as-judge
For open-ended tasks (summarization, creative writing, conversational quality), assertions alone can't capture what "good" means. This is where you use a stronger model as a judge.
The rubric is everything. A vague rubric ("rate the quality from 1 to 5") produces inconsistent scores. A specific rubric with examples at each score level produces scores that correlate with human judgments at 80-90% agreement rates.
Three biases to know and design around:
Position bias: the judge prefers whichever response appears first. Fix: randomize positions and run each comparison twice with positions swapped.
Verbosity bias: longer responses score higher even when brevity is correct. Fix: explicitly instruct the judge to penalize unnecessary length.
Self-enhancement bias: a model rates its own family's outputs higher. Fix: use a different model family as the judge (judge with Claude if your system uses GPT, or vice versa).
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.