Design an LLM eval pipeline

TL;DR

Four eval tiers run at different costs and cadences: assertion-based (every commit), RAGAS automated (staging), LLM-as-judge (PR merge), and human eval (weekly).
The golden dataset is version-controlled alongside the prompt. Changing the prompt without updating the dataset is a common cause of silent regressions.
LLM-as-judge uses a strong judge model (GPT-4o) separate from the model under test. Never use the same model to judge its own output.
The CI/CD gate blocks deployment if the primary metric (faithfulness) drops more than 5% from baseline. Secondary metrics trigger warnings, not blocks.
Regression tracking in a time-series DB with a 7-day rolling average catches slow quality drift that single-run thresholds miss.

Requirements

Functional requirements

The eval pipeline runs 5,000 test cases per pipeline execution, covering real user questions with expected behavior.
Results integrate with CI/CD: a failing eval run blocks deployment to production.
Quality metrics (faithfulness, context relevance, answer relevancy) are tracked over time with trend visualization.
Human evaluators can label a sampled subset of responses through a review UI, and those labels feed back into the pipeline.
The pipeline runs on every prompt change, not just model changes.

Non-functional requirements

Full eval run (5,000 cases) must complete in under 30 minutes to fit in a CI pipeline.
LLM-as-judge must complete within 45 minutes to not block PR merges.
Cost per full eval run under $50 (5K cases at LLM-as-judge pricing).
Eval results are immutable once written; each run produces a versioned snapshot for auditing.
Alert on-call if the 7-day rolling faithfulness average drops below the alert threshold.

The core entities

EvalCase

case_id, question, expected_behavior (can be exact string, reference answer, or behavioral description), category (faq/adversarial/edge), created_by, version

EvalRun

run_id, trigger (commit_hash/pr_id/schedule), prompt_version, model_version, status, started_at, completed_at, cases_run, cases_passed

EvalResult (one per case per run)

result_id, run_id, case_id, response_text, assertion_pass, faithfulness, context_relevance, answer_relevancy, judge_score, judge_reasoning, latency_ms

HumanLabel

label_id, result_id, evaluator_id, helpfulness (0-3), accuracy (0-3), conciseness (0-3), override_verdict, created_at

API design

POST /api/evals/trigger — kick off an eval run from CI

Request:  { "commit_hash": "a3f9b12", "prompt_version": "v2.4.1", "run_type": "pr_merge" }
Response: { "run_id": "run_789", "status": "queued", "estimated_duration_s": 900 }

GET /api/evals/runs/{run_id}

Response: {
  "run_id": "run_789", "status": "complete",
  "summary": { "faithfulness": 0.87, "context_relevance": 0.82, "judge_avg_score": 2.4 },
  "gate_decision": "PASS",
  "baseline_delta": { "faithfulness": +0.02 }
}

POST /api/evals/human-labels — submit human review batch

Request:  { "labels": [{ "result_id": "r_abc", "helpfulness": 3, "accuracy": 2, "conciseness": 3 }] }
Response: { "received": 1, "calibration_delta": 0.03 }

GET /api/evals/trends?metric=faithfulness&days=30

Response: { "metric": "faithfulness", "datapoints": [{"date": "2026-04-04", "value": 0.85},...] }

Every prompt change triggers a CI eval run. The eval runner distributes 5,000 test cases as async jobs across a worker pool, which processes them in parallel and writes results to the scores database. A score aggregator collects individual results, computes summary metrics, and compares to the stored baseline. The gate decides: pass (deploy) or fail (block).

The four eval tiers are stacked by cost and cadence. Assertion-based tests run on every commit in seconds. RAGAS automated metrics run on the staging environment before merge. LLM-as-judge runs on PR merge as the final automated gate. Human eval runs weekly on a sampled subset and recalibrates the judge model's scoring. Each tier catches different failure modes, so skipping any one creates blind spots.

The most important thing to get right is the golden dataset. It must be version-controlled in the same repository as the prompt, reviewed when the prompt changes, and augmented regularly with real user questions from production. A dataset that was curated for an old product version will generate misleading scores for a new one.

TL;DR

Four eval tiers run at different costs and cadences: assertion-based (every commit), RAGAS automated (staging), LLM-as-judge (PR merge), and human eval (weekly).
The golden dataset is version-controlled alongside the prompt. Changing the prompt without updating the dataset is a common cause of silent regressions.
LLM-as-judge uses a strong judge model (GPT-4o) separate from the model under test. Never use the same model to judge its own output.
The CI/CD gate blocks deployment if the primary metric (faithfulness) drops more than 5% from baseline. Secondary metrics trigger warnings, not blocks.
Regression tracking in a time-series DB with a 7-day rolling average catches slow quality drift that single-run thresholds miss.

Requirements

Functional requirements

The eval pipeline runs 5,000 test cases per pipeline execution, covering real user questions with expected behavior.
Results integrate with CI/CD: a failing eval run blocks deployment to production.
Quality metrics (faithfulness, context relevance, answer relevancy) are tracked over time with trend visualization.
Human evaluators can label a sampled subset of responses through a review UI, and those labels feed back into the pipeline.
The pipeline runs on every prompt change, not just model changes.

Non-functional requirements

Full eval run (5,000 cases) must complete in under 30 minutes to fit in a CI pipeline.
LLM-as-judge must complete within 45 minutes to not block PR merges.
Cost per full eval run under $50 (5K cases at LLM-as-judge pricing).
Eval results are immutable once written; each run produces a versioned snapshot for auditing.
Alert on-call if the 7-day rolling faithfulness average drops below the alert threshold.

The core entities

EvalCase

case_id, question, expected_behavior (can be exact string, reference answer, or behavioral description), category (faq/adversarial/edge), created_by, version

EvalRun

run_id, trigger (commit_hash/pr_id/schedule), prompt_version, model_version, status, started_at, completed_at, cases_run, cases_passed

EvalResult (one per case per run)

result_id, run_id, case_id, response_text, assertion_pass, faithfulness, context_relevance, answer_relevancy, judge_score, judge_reasoning, latency_ms

HumanLabel

label_id, result_id, evaluator_id, helpfulness (0-3), accuracy (0-3), conciseness (0-3), override_verdict, created_at

API design

POST /api/evals/trigger — kick off an eval run from CI

Request:  { "commit_hash": "a3f9b12", "prompt_version": "v2.4.1", "run_type": "pr_merge" }
Response: { "run_id": "run_789", "status": "queued", "estimated_duration_s": 900 }

GET /api/evals/runs/{run_id}

Response: {
  "run_id": "run_789", "status": "complete",
  "summary": { "faithfulness": 0.87, "context_relevance": 0.82, "judge_avg_score": 2.4 },
  "gate_decision": "PASS",
  "baseline_delta": { "faithfulness": +0.02 }
}

POST /api/evals/human-labels — submit human review batch

Request:  { "labels": [{ "result_id": "r_abc", "helpfulness": 3, "accuracy": 2, "conciseness": 3 }] }
Response: { "received": 1, "calibration_delta": 0.03 }

GET /api/evals/trends?metric=faithfulness&days=30

Response: { "metric": "faithfulness", "datapoints": [{"date": "2026-04-04", "value": 0.85},...] }

Design an LLM eval pipeline

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design an LLM eval pipeline

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments