Coding agent CI feedback loop

TL;DR

A coding agent CI feedback loop connects an LLM coding agent to a CI pipeline (build, tests, linter, type checker) so the agent receives real pass/fail signals instead of guessing whether its code is correct.
The core loop: agent writes code, CI runs, failures feed back as structured text, agent fixes the specific failures, CI runs again. Stop when green or after 3-5 iterations.
CI output is deterministic ground truth. pytest either passes or it doesn't. This replaces speculative self-critique ("does this look right?") with verifiable feedback ("line 42: NameError: name 'user_id' is not defined").
Structured feedback parsing is critical. Raw CI logs are noisy. Extract the failing test name, the error message, and the file/line location. Feed only the relevant failures to the agent, not 500 lines of build output.
Set a hard iteration cap (3-5 cycles). Without a cap, the agent can enter infinite fix-break loops where fixing one test breaks another, burning tokens and compute indefinitely.
Limitation: CI feedback tells the agent what failed, not why. The agent still needs reasoning capability to diagnose root causes. Fast CI (under 60 seconds) is essential. Slow pipelines break the tight feedback loop and inflate cost per iteration.

Your coding agent generates a Python function that parses CSV files. It looks correct. The variable names make sense. The logic flows reasonably. The agent even explains its reasoning. But when you run the code, it crashes: the csv.reader call uses the wrong delimiter parameter name. The agent wrote separator=',' instead of delimiter=','.

No amount of LLM self-reflection would have caught this. The agent can re-read its own code a hundred times, and it will still think separator is a valid parameter. The knowledge isn't in the model's weights for this specific API. The only way to catch it is to actually run the code and observe the TypeError: __init__() got an unexpected keyword argument 'separator'.

This is the fundamental problem: LLMs generate plausible code, not correct code. The difference between plausible and correct is only detectable at runtime. Without a runtime feedback signal, the agent is flying blind. Every iteration of "let me re-examine my code" is the agent staring at the same wrong answer and confirming it looks fine.

Human code review catches some of these issues, but it's slow (hours to days), expensive (senior engineer time), and scales linearly with agent output volume. CI catches them in seconds, scales to thousands of runs per day, and never gets tired or misses a known test case.

What Is It?

The coding agent CI feedback loop connects an LLM coding agent to a continuous integration pipeline so the agent can submit code, receive real execution results (build output, test results, lint errors, type check failures), and use those results to fix its own code autonomously. The agent treats CI as an oracle: an authoritative source of truth about code correctness.

Think of it as a musician practicing with a tuner instead of "by ear." Playing by ear, you might think a note sounds right because your internal reference is slightly off. A tuner gives you objective, measurable feedback: you're 15 cents sharp. You adjust. You check again. Eventually, you're in tune. The tuner doesn't tell you how to adjust your fingering, but it tells you exactly what's wrong, which is often enough.

The loop terminates in two ways: all CI checks pass (success), or the iteration count exceeds the cap (failure with partial results). Both outcomes are valid. A coding agent that fixes 3 of 4 test failures in 3 iterations and returns with a clear description of the remaining failure is still useful.

How It Works

The feedback loop architecture

The architecture has four components: the agent, a code submission layer, the CI pipeline, and a feedback parser. The agent writes code. The submission layer commits it to a branch and triggers CI. The CI pipeline runs build, test, and lint stages. The feedback parser extracts structured failure information from CI output and feeds it back to the agent.

class CIFeedbackLoop:
    def __init__(self, agent, ci_runner, max_iterations=4):
        self.agent = agent
        self.ci = ci_runner
        self.max_iterations = max_iterations
        self.history = []  # Track each iteration

    async def run(self, task: str) -> dict:
        code = await self.agent.generate(task)

        for i in range(self.max_iterations):
            ci_result = await self.ci.run(code)
            self.history.append({
                "iteration": i + 1,
                "code_hash": hash(code),
                "ci_passed": ci_result.passed,
                "failures": ci_result.failures,
            })

            if ci_result.passed:
                return {"status": "success", "code": code,
                        "iterations": i + 1}

            # Parse failures into structured feedback
            feedback = self.parse_failures(ci_result)
            code = await self.agent.fix(code, feedback)

        return {"status": "max_iterations", "code": code,
                "remaining_failures": ci_result.failures,
                "iterations": self.max_iterations}

The history list is important for debugging. When something goes wrong, you can trace exactly what the agent produced at each step, what CI reported, and how the agent responded. I've found this history invaluable for diagnosing fix-break cycles where the agent oscillates between two broken states.

Structured feedback parsing

Raw CI output is hostile to LLMs. A typical pytest failure dump contains 50+ lines of traceback, fixture setup, captured stdout, and comparison diffs. Feeding all of it to the agent wastes tokens, confuses the model, and buries the actual error signal in noise.

The feedback parser extracts only what the agent needs to act: which file, which line, what error, and which test.

def parse_failures(self, ci_result) -> str:
    """Convert raw CI output to structured agent-friendly feedback."""
    structured = []
    for failure in ci_result.failures:
        entry = {
            "stage": failure.stage,     # "test", "lint", "build", "typecheck"
            "file": failure.file_path,
            "line": failure.line_number,
            "error_type": failure.error_type,  # "NameError", "AssertionError"
            "message": failure.message[:200],  # Truncate long messages
            "test_name": failure.test_name,    # "test_parse_csv_with_headers"
        }
        if failure.stage == "test" and failure.expected_vs_actual:
            entry["expected"] = failure.expected_vs_actual["expected"]
            entry["actual"] = failure.expected_vs_actual["actual"]
        structured.append(entry)

    # Format as concise text for the LLM
    lines = [f"CI found {len(structured)} failure(s):\n"]
    for i, f in enumerate(structured, 1):
        lines.append(f"{i}. [{f['stage']}] {f['file']}:{f['line']}")
        lines.append(f"   {f['error_type']}: {f['message']}")
        if "expected" in f:
            lines.append(f"   Expected: {f['expected']}")
            lines.append(f"   Actual:   {f['actual']}")
    return "\n".join(lines)

The truncation at 200 characters per message is deliberate. Some error messages (especially assertion diffs on large data structures) can be thousands of characters. The agent needs to know what failed and roughly why, not parse a 2000-character diff. For assertion failures, the expected vs. actual values provide the most actionable signal.

Feedback granularity: what to include

Not all CI output is equally useful. Here's the priority ranking:

CI Stage	Feedback Value	What to Include	What to Exclude
Build/compile errors	Highest	File, line, error message	Full stack trace, build system logs
Test failures	High	Test name, assertion error, expected vs actual	Fixture setup, captured stdout, warnings
Type check errors	High	File, line, type mismatch description	Config file paths, import resolution logs
Lint violations	Medium	Rule name, file, line, auto-fix suggestion	Lint config details, summary statistics
Coverage drops	Low	Uncovered file/function names	Line-by-line coverage diffs

Build errors come first because they block everything else. If the code doesn't compile, test failures are irrelevant noise. Feed build errors to the agent before test errors. I've seen agents waste iterations fixing test assertions when the real problem was a syntax error on line 3 that made the entire file unparseable.

The iteration dynamics

Each iteration should fix at least one failure. If the agent's fix introduces more failures than it resolves, that's a signal the agent is out of its depth on this particular problem. A good feedback loop implementation tracks the failure count per iteration and aborts early if the count increases for two consecutive iterations (the "regression detector").

def should_abort_early(self) -> bool:
    """Abort if agent is making things worse."""
    if len(self.history) < 2:
        return False
    recent = self.history[-1]["failures"]
    previous = self.history[-2]["failures"]
    return len(recent) > len(previous)

Connecting to real CI systems

In practice, the CI connection takes one of three forms:

Git-based (production-like). The agent commits to a branch, pushes, and a GitHub Actions/GitLab CI pipeline runs. The feedback loop polls the CI API for results. This is the most realistic but slowest (2-5 minutes per cycle including queue time).

TL;DR

A coding agent CI feedback loop connects an LLM coding agent to a CI pipeline (build, tests, linter, type checker) so the agent receives real pass/fail signals instead of guessing whether its code is correct.
The core loop: agent writes code, CI runs, failures feed back as structured text, agent fixes the specific failures, CI runs again. Stop when green or after 3-5 iterations.
CI output is deterministic ground truth. pytest either passes or it doesn't. This replaces speculative self-critique ("does this look right?") with verifiable feedback ("line 42: NameError: name 'user_id' is not defined").
Structured feedback parsing is critical. Raw CI logs are noisy. Extract the failing test name, the error message, and the file/line location. Feed only the relevant failures to the agent, not 500 lines of build output.
Set a hard iteration cap (3-5 cycles). Without a cap, the agent can enter infinite fix-break loops where fixing one test breaks another, burning tokens and compute indefinitely.
Limitation: CI feedback tells the agent what failed, not why. The agent still needs reasoning capability to diagnose root causes. Fast CI (under 60 seconds) is essential. Slow pipelines break the tight feedback loop and inflate cost per iteration.

class CIFeedbackLoop:
    def __init__(self, agent, ci_runner, max_iterations=4):
        self.agent = agent
        self.ci = ci_runner
        self.max_iterations = max_iterations
        self.history = []  # Track each iteration

    async def run(self, task: str) -> dict:
        code = await self.agent.generate(task)

        for i in range(self.max_iterations):
            ci_result = await self.ci.run(code)
            self.history.append({
                "iteration": i + 1,
                "code_hash": hash(code),
                "ci_passed": ci_result.passed,
                "failures": ci_result.failures,
            })

            if ci_result.passed:
                return {"status": "success", "code": code,
                        "iterations": i + 1}

            # Parse failures into structured feedback
            feedback = self.parse_failures(ci_result)
            code = await self.agent.fix(code, feedback)

        return {"status": "max_iterations", "code": code,
                "remaining_failures": ci_result.failures,
                "iterations": self.max_iterations}

Structured feedback parsing

The feedback parser extracts only what the agent needs to act: which file, which line, what error, and which test.

def parse_failures(self, ci_result) -> str:
    """Convert raw CI output to structured agent-friendly feedback."""
    structured = []
    for failure in ci_result.failures:
        entry = {
            "stage": failure.stage,     # "test", "lint", "build", "typecheck"
            "file": failure.file_path,
            "line": failure.line_number,
            "error_type": failure.error_type,  # "NameError", "AssertionError"
            "message": failure.message[:200],  # Truncate long messages
            "test_name": failure.test_name,    # "test_parse_csv_with_headers"
        }
        if failure.stage == "test" and failure.expected_vs_actual:
            entry["expected"] = failure.expected_vs_actual["expected"]
            entry["actual"] = failure.expected_vs_actual["actual"]
        structured.append(entry)

    # Format as concise text for the LLM
    lines = [f"CI found {len(structured)} failure(s):\n"]
    for i, f in enumerate(structured, 1):
        lines.append(f"{i}. [{f['stage']}] {f['file']}:{f['line']}")
        lines.append(f"   {f['error_type']}: {f['message']}")
        if "expected" in f:
            lines.append(f"   Expected: {f['expected']}")
            lines.append(f"   Actual:   {f['actual']}")
    return "\n".join(lines)

Feedback granularity: what to include

Not all CI output is equally useful. Here's the priority ranking:

CI Stage	Feedback Value	What to Include	What to Exclude
Build/compile errors	Highest	File, line, error message	Full stack trace, build system logs
Test failures	High	Test name, assertion error, expected vs actual	Fixture setup, captured stdout, warnings
Type check errors	High	File, line, type mismatch description	Config file paths, import resolution logs
Lint violations	Medium	Rule name, file, line, auto-fix suggestion	Lint config details, summary statistics
Coverage drops	Low	Uncovered file/function names	Line-by-line coverage diffs

The iteration dynamics

def should_abort_early(self) -> bool:
    """Abort if agent is making things worse."""
    if len(self.history) < 2:
        return False
    recent = self.history[-1]["failures"]
    previous = self.history[-2]["failures"]
    return len(recent) > len(previous)

Connecting to real CI systems

In practice, the CI connection takes one of three forms:

Coding agent CI feedback loop

TL;DR

The Problem It Solves

What Is It?

How It Works

The feedback loop architecture

Structured feedback parsing

Feedback granularity: what to include

The iteration dynamics

Connecting to real CI systems

Continue Reading with Premium

Comments

Coding agent CI feedback loop

TL;DR

The Problem It Solves

What Is It?

How It Works

The feedback loop architecture

Structured feedback parsing

Feedback granularity: what to include

The iteration dynamics

Connecting to real CI systems

Continue Reading with Premium

Comments