Coding agent CI feedback loop
Connect coding agents to CI pipelines so they receive build errors, test failures, and lint violations as structured feedback, enabling autonomous fix-and-retry cycles.
TL;DR
- A coding agent CI feedback loop connects an LLM coding agent to a CI pipeline (build, tests, linter, type checker) so the agent receives real pass/fail signals instead of guessing whether its code is correct.
- The core loop: agent writes code, CI runs, failures feed back as structured text, agent fixes the specific failures, CI runs again. Stop when green or after 3-5 iterations.
- CI output is deterministic ground truth.
pytesteither passes or it doesn't. This replaces speculative self-critique ("does this look right?") with verifiable feedback ("line 42: NameError: name 'user_id' is not defined"). - Structured feedback parsing is critical. Raw CI logs are noisy. Extract the failing test name, the error message, and the file/line location. Feed only the relevant failures to the agent, not 500 lines of build output.
- Set a hard iteration cap (3-5 cycles). Without a cap, the agent can enter infinite fix-break loops where fixing one test breaks another, burning tokens and compute indefinitely.
- Limitation: CI feedback tells the agent what failed, not why. The agent still needs reasoning capability to diagnose root causes. Fast CI (under 60 seconds) is essential. Slow pipelines break the tight feedback loop and inflate cost per iteration.
The Problem It Solves
Your coding agent generates a Python function that parses CSV files. It looks correct. The variable names make sense. The logic flows reasonably. The agent even explains its reasoning. But when you run the code, it crashes: the csv.reader call uses the wrong delimiter parameter name. The agent wrote separator=',' instead of delimiter=','.
No amount of LLM self-reflection would have caught this. The agent can re-read its own code a hundred times, and it will still think separator is a valid parameter. The knowledge isn't in the model's weights for this specific API. The only way to catch it is to actually run the code and observe the TypeError: __init__() got an unexpected keyword argument 'separator'.
This is the fundamental problem: LLMs generate plausible code, not correct code. The difference between plausible and correct is only detectable at runtime. Without a runtime feedback signal, the agent is flying blind. Every iteration of "let me re-examine my code" is the agent staring at the same wrong answer and confirming it looks fine.
Human code review catches some of these issues, but it's slow (hours to days), expensive (senior engineer time), and scales linearly with agent output volume. CI catches them in seconds, scales to thousands of runs per day, and never gets tired or misses a known test case.
What Is It?
The coding agent CI feedback loop connects an LLM coding agent to a continuous integration pipeline so the agent can submit code, receive real execution results (build output, test results, lint errors, type check failures), and use those results to fix its own code autonomously. The agent treats CI as an oracle: an authoritative source of truth about code correctness.
Think of it as a musician practicing with a tuner instead of "by ear." Playing by ear, you might think a note sounds right because your internal reference is slightly off. A tuner gives you objective, measurable feedback: you're 15 cents sharp. You adjust. You check again. Eventually, you're in tune. The tuner doesn't tell you how to adjust your fingering, but it tells you exactly what's wrong, which is often enough.
The loop terminates in two ways: all CI checks pass (success), or the iteration count exceeds the cap (failure with partial results). Both outcomes are valid. A coding agent that fixes 3 of 4 test failures in 3 iterations and returns with a clear description of the remaining failure is still useful.
How It Works
The feedback loop architecture
The architecture has four components: the agent, a code submission layer, the CI pipeline, and a feedback parser. The agent writes code. The submission layer commits it to a branch and triggers CI. The CI pipeline runs build, test, and lint stages. The feedback parser extracts structured failure information from CI output and feeds it back to the agent.
class CIFeedbackLoop:
def __init__(self, agent, ci_runner, max_iterations=4):
self.agent = agent
self.ci = ci_runner
self.max_iterations = max_iterations
self.history = [] # Track each iteration
async def run(self, task: str) -> dict:
code = await self.agent.generate(task)
for i in range(self.max_iterations):
ci_result = await self.ci.run(code)
self.history.append({
"iteration": i + 1,
"code_hash": hash(code),
"ci_passed": ci_result.passed,
"failures": ci_result.failures,
})
if ci_result.passed:
return {"status": "success", "code": code,
"iterations": i + 1}
# Parse failures into structured feedback
feedback = self.parse_failures(ci_result)
code = await self.agent.fix(code, feedback)
return {"status": "max_iterations", "code": code,
"remaining_failures": ci_result.failures,
"iterations": self.max_iterations}
The history list is important for debugging. When something goes wrong, you can trace exactly what the agent produced at each step, what CI reported, and how the agent responded. I've found this history invaluable for diagnosing fix-break cycles where the agent oscillates between two broken states.
Structured feedback parsing
Raw CI output is hostile to LLMs. A typical pytest failure dump contains 50+ lines of traceback, fixture setup, captured stdout, and comparison diffs. Feeding all of it to the agent wastes tokens, confuses the model, and buries the actual error signal in noise.
The feedback parser extracts only what the agent needs to act: which file, which line, what error, and which test.
def parse_failures(self, ci_result) -> str:
"""Convert raw CI output to structured agent-friendly feedback."""
structured = []
for failure in ci_result.failures:
entry = {
"stage": failure.stage, # "test", "lint", "build", "typecheck"
"file": failure.file_path,
"line": failure.line_number,
"error_type": failure.error_type, # "NameError", "AssertionError"
"message": failure.message[:200], # Truncate long messages
"test_name": failure.test_name, # "test_parse_csv_with_headers"
}
if failure.stage == "test" and failure.expected_vs_actual:
entry["expected"] = failure.expected_vs_actual["expected"]
entry["actual"] = failure.expected_vs_actual["actual"]
structured.append(entry)
# Format as concise text for the LLM
lines = [f"CI found {len(structured)} failure(s):\n"]
for i, f in enumerate(structured, 1):
lines.append(f"{i}. [{f['stage']}] {f['file']}:{f['line']}")
lines.append(f" {f['error_type']}: {f['message']}")
if "expected" in f:
lines.append(f" Expected: {f['expected']}")
lines.append(f" Actual: {f['actual']}")
return "\n".join(lines)
The truncation at 200 characters per message is deliberate. Some error messages (especially assertion diffs on large data structures) can be thousands of characters. The agent needs to know what failed and roughly why, not parse a 2000-character diff. For assertion failures, the expected vs. actual values provide the most actionable signal.
Feedback granularity: what to include
Not all CI output is equally useful. Here's the priority ranking:
| CI Stage | Feedback Value | What to Include | What to Exclude |
|---|---|---|---|
| Build/compile errors | Highest | File, line, error message | Full stack trace, build system logs |
| Test failures | High | Test name, assertion error, expected vs actual | Fixture setup, captured stdout, warnings |
| Type check errors | High | File, line, type mismatch description | Config file paths, import resolution logs |
| Lint violations | Medium | Rule name, file, line, auto-fix suggestion | Lint config details, summary statistics |
| Coverage drops | Low | Uncovered file/function names | Line-by-line coverage diffs |
Build errors come first because they block everything else. If the code doesn't compile, test failures are irrelevant noise. Feed build errors to the agent before test errors. I've seen agents waste iterations fixing test assertions when the real problem was a syntax error on line 3 that made the entire file unparseable.
The iteration dynamics
Each iteration should fix at least one failure. If the agent's fix introduces more failures than it resolves, that's a signal the agent is out of its depth on this particular problem. A good feedback loop implementation tracks the failure count per iteration and aborts early if the count increases for two consecutive iterations (the "regression detector").
def should_abort_early(self) -> bool:
"""Abort if agent is making things worse."""
if len(self.history) < 2:
return False
recent = self.history[-1]["failures"]
previous = self.history[-2]["failures"]
return len(recent) > len(previous)
Connecting to real CI systems
In practice, the CI connection takes one of three forms:
Git-based (production-like). The agent commits to a branch, pushes, and a GitHub Actions/GitLab CI pipeline runs. The feedback loop polls the CI API for results. This is the most realistic but slowest (2-5 minutes per cycle including queue time).
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.