Design an LLM eval pipeline
Walk through designing an automated evaluation pipeline that catches quality regressions before deployment, handles thousands of test cases daily, and integrates into CI/CD.
TL;DR
- Four eval tiers run at different costs and cadences: assertion-based (every commit), RAGAS automated (staging), LLM-as-judge (PR merge), and human eval (weekly).
- The golden dataset is version-controlled alongside the prompt. Changing the prompt without updating the dataset is a common cause of silent regressions.
- LLM-as-judge uses a strong judge model (GPT-4o) separate from the model under test. Never use the same model to judge its own output.
- The CI/CD gate blocks deployment if the primary metric (faithfulness) drops more than 5% from baseline. Secondary metrics trigger warnings, not blocks.
- Regression tracking in a time-series DB with a 7-day rolling average catches slow quality drift that single-run thresholds miss.
Requirements
Functional requirements
- The eval pipeline runs 5,000 test cases per pipeline execution, covering real user questions with expected behavior.
- Results integrate with CI/CD: a failing eval run blocks deployment to production.
- Quality metrics (faithfulness, context relevance, answer relevancy) are tracked over time with trend visualization.
- Human evaluators can label a sampled subset of responses through a review UI, and those labels feed back into the pipeline.
- The pipeline runs on every prompt change, not just model changes.
Non-functional requirements
- Full eval run (5,000 cases) must complete in under 30 minutes to fit in a CI pipeline.
- LLM-as-judge must complete within 45 minutes to not block PR merges.
- Cost per full eval run under $50 (5K cases at LLM-as-judge pricing).
- Eval results are immutable once written; each run produces a versioned snapshot for auditing.
- Alert on-call if the 7-day rolling faithfulness average drops below the alert threshold.
The core entities
EvalCase
case_id,question,expected_behavior(can be exact string, reference answer, or behavioral description),category(faq/adversarial/edge),created_by,version
EvalRun
run_id,trigger(commit_hash/pr_id/schedule),prompt_version,model_version,status,started_at,completed_at,cases_run,cases_passed
EvalResult (one per case per run)
result_id,run_id,case_id,response_text,assertion_pass,faithfulness,context_relevance,answer_relevancy,judge_score,judge_reasoning,latency_ms
HumanLabel
label_id,result_id,evaluator_id,helpfulness(0-3),accuracy(0-3),conciseness(0-3),override_verdict,created_at
API design
POST /api/evals/trigger β kick off an eval run from CI
Request: { "commit_hash": "a3f9b12", "prompt_version": "v2.4.1", "run_type": "pr_merge" }
Response: { "run_id": "run_789", "status": "queued", "estimated_duration_s": 900 }
GET /api/evals/runs/{run_id}
Response: {
"run_id": "run_789", "status": "complete",
"summary": { "faithfulness": 0.87, "context_relevance": 0.82, "judge_avg_score": 2.4 },
"gate_decision": "PASS",
"baseline_delta": { "faithfulness": +0.02 }
}
POST /api/evals/human-labels β submit human review batch
Request: { "labels": [{ "result_id": "r_abc", "helpfulness": 3, "accuracy": 2, "conciseness": 3 }] }
Response: { "received": 1, "calibration_delta": 0.03 }
GET /api/evals/trends?metric=faithfulness&days=30
Response: { "metric": "faithfulness", "datapoints": [{"date": "2026-04-04", "value": 0.85},...] }
High-level design
Every prompt change triggers a CI eval run. The eval runner distributes 5,000 test cases as async jobs across a worker pool, which processes them in parallel and writes results to the scores database. A score aggregator collects individual results, computes summary metrics, and compares to the stored baseline. The gate decides: pass (deploy) or fail (block).
The four eval tiers are stacked by cost and cadence. Assertion-based tests run on every commit in seconds. RAGAS automated metrics run on the staging environment before merge. LLM-as-judge runs on PR merge as the final automated gate. Human eval runs weekly on a sampled subset and recalibrates the judge model's scoring. Each tier catches different failure modes, so skipping any one creates blind spots.
The most important thing to get right is the golden dataset. It must be version-controlled in the same repository as the prompt, reviewed when the prompt changes, and augmented regularly with real user questions from production. A dataset that was curated for an old product version will generate misleading scores for a new one.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.