Spec-as-test feedback loop
Convert natural language specifications into executable test cases that verify agent outputs, creating an automated acceptance testing loop from requirements.
TL;DR
- The spec-as-test feedback loop uses one LLM to convert natural language specifications into executable test assertions, then uses those tests to evaluate another agent's output. The spec becomes the automated acceptance test suite.
- The pipeline: spec (English) goes to a test generator (LLM), which produces test cases (code). A separate agent produces output. The tests evaluate the output. Failures feed back to the agent as structured feedback for retry.
- This eliminates the manual translation gap between "what we asked for" and "how we verify it." When the spec changes, tests regenerate automatically. No human test-writing bottleneck.
- Generated tests expose ambiguity in the spec. If the test generator has to commit to a specific assertion but the spec is vague, the resulting test either reveals the ambiguity (by making a wrong assumption) or forces the spec author to be precise.
- Self-healing loop: when the agent's output fails generated tests, the failure messages (line numbers, assertion mismatches, expected vs. actual) become structured feedback. The agent retries with this concrete signal rather than guessing.
- Limitation: LLM-generated tests can have bugs themselves. You need test validation: run generated tests against known-good and known-bad outputs to verify the tests work before trusting them as oracles.
The Problem It Solves
Your product manager writes a spec: "The search endpoint should return results sorted by relevance, support pagination with cursor-based navigation, and filter by date range. Results should include title, snippet, and score. Rate limit at 100 requests per minute per user."
Your coding agent reads this spec and generates the implementation. It looks reasonable. The code compiles. The endpoint responds. But does it actually match the spec? Is the pagination cursor-based or offset-based? Does the rate limiter check per-user or per-IP? Is the date range inclusive or exclusive on the boundaries?
Someone has to write tests to verify this. And that someone is usually a human. The human reads the spec, interprets it (introducing their own assumptions), writes test cases (which may not cover edge cases), and runs them. This takes hours. If the spec changes, the tests need manual updating too. The gap between "spec written" and "spec verified" is where bugs hide and schedules slip.
I've seen this cycle repeat on every team I've worked with: spec goes through three revisions, tests lag behind by two revisions, and the implementation passes tests that don't match the current spec. Everyone thinks they're aligned. They're not.
The deeper problem is that specifications and tests are the same information expressed in two different languages. The spec says "sort by relevance." The test says assert results[0].score >= results[1].score. These are saying the same thing. The manual translation between them is where errors creep in.
What Is It?
The spec-as-test feedback loop uses an LLM to automatically convert natural language specifications into executable test cases, then runs those tests against agent output to verify correctness. When tests fail, the structured failure messages (assertion errors, expected vs. actual values, stack traces) feed back to the agent as concrete, actionable feedback for retry.
Think of it as a building inspector who reads the blueprint and generates a checklist. The architect draws the blueprint (the spec). The inspector reads it and creates specific, measurable checkpoints: "wall must be load-bearing, minimum 8 inches thick at this location" and "electrical outlets must be every 12 feet along this wall." The contractor (the agent) builds. The inspector walks through with the checklist. Failed items get a precise note ("outlet missing between windows 3 and 4") that the contractor can act on without re-reading the entire blueprint.
The spec is the single source of truth. Change the spec, regenerate the tests, re-evaluate. The implementation must match the spec, not the other way around. The agent cannot "adjust the test to pass." This constraint is what makes the pattern reliable.
How It Works
From English to executable assertions
The test generator LLM reads the spec and produces executable test cases. This is a structured extraction task: for each requirement sentence, generate one or more test functions that assert the requirement is met.
The quality of generated tests depends entirely on the specificity of the spec. A vague spec like "results should be relevant" produces a vague test (assert len(results) > 0). A precise spec like "results sorted by BM25 relevance score, descending" produces a precise test (assert all(results[i].score >= results[i+1].score for i in range(len(results)-1))).
# Example: spec sentence β generated test
# Spec: "Return results sorted by relevance score, descending"
def test_results_sorted_by_relevance():
response = client.get("/search?q=python+tutorial")
results = response.json()["results"]
scores = [r["score"] for r in results]
assert scores == sorted(scores, reverse=True), \
f"Results not sorted by score: {scores}"
# Spec: "Support cursor-based pagination with 20 results per page"
def test_cursor_pagination():
page1 = client.get("/search?q=python&limit=20")
cursor = page1.json()["next_cursor"]
assert cursor is not None, "Missing pagination cursor"
assert len(page1.json()["results"]) == 20
page2 = client.get(f"/search?q=python&limit=20&cursor={cursor}")
assert page2.json()["results"][0]["id"] != page1.json()["results"][0]["id"], \
"Page 2 returned same results as page 1"
# Spec: "Rate limit at 100 requests per minute per user"
def test_rate_limit():
for i in range(101):
response = client.get("/search?q=test",
headers={"X-User-Id": "user-1"})
assert response.status_code == 429, \
f"Expected 429 after 100 requests, got {response.status_code}"
The test types generated depend on what the spec describes:
| Spec Content | Test Type Generated | Example |
|---|---|---|
| Output format requirements | Schema validation | assert "title" in result and "score" in result |
| Ordering/sorting rules | Sequential assertions | assert results == sorted(results, key=...) |
| Rate limits / thresholds | Boundary tests | assert response.status_code == 429 after N+1 calls |
| Data transformations | Input/output diffing | assert transform(input) == expected_output |
| Behavioral rules | State-based tests | assert after_action.state == expected_state |
| Error handling | Negative tests | assert error_response.status_code == 400 |
The feedback loop architecture
The feedback loop has four phases: generate tests from spec, validate the tests themselves, run tests against agent output, and feed failures back for retry. Each phase has its own failure modes and quality checks.
The test validation step is critical and often skipped. I've seen teams deploy generated tests that had bugs in the test setup (wrong base URL, missing auth headers). The agent "fixed" its code to pass a broken test, introducing a real bug. Always validate generated tests against a known-good reference implementation or manually reviewed test cases before trusting them.
Self-healing through structured failure feedback
When tests fail, the failure output is the highest-quality feedback signal available. Unlike vague self-critique ("this looks wrong"), test failures are precise: file, line number, assertion, expected value, actual value. This structured signal tells the agent exactly what to fix.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.