Design an autonomous coding agent

TL;DR

Multi-strategy code retrieval (AST-aware chunking + dependency graph traversal + semantic search over embeddings + file path matching) assembles a focused context of 10-20 relevant files per task, achieving 2-3x better task completion on SWE-bench than naive keyword search.
Iterative plan-then-execute with plan revision (the ReAct loop applied to coding) solves 27% of SWE-bench Lite issues versus 12% for single-shot generation. The agent plans, writes code, observes test results, and revises the plan for up to 10 iterations.
Layered sandboxing (ephemeral Docker containers with no network, seccomp profiles, writable overlays, defined tool interfaces) prevents catastrophic failures like rm -rf / or credential theft from agent-generated code.
Structured test-feedback parsing (extract specific assertion errors, retrieve the failing test and tested function, generate a diagnosis before a fix) reduces average fix iterations from 4.1 to 2.3 rounds, saving 50-80% of CI compute.
The production lesson: the agent that writes the code is easy. The agent that knows when to stop, when to ask for help, and when its own plan is wrong is the hard part. Termination logic is what separates a demo from a product.

Requirements

Functional requirements

The system accepts a task description (GitHub issue, Jira ticket, or plain text) and produces a working code change that addresses the task.
The agent indexes and understands the target codebase, retrieving relevant files, functions, and dependencies for context assembly.
The agent generates a step-by-step plan for the code change, executes each step, and revises the plan based on intermediate results.
The agent runs the existing test suite (plus any new tests it writes) in a sandboxed environment and iterates on failures until tests pass.
The agent opens a pull request with the code changes, a summary of what was changed and why, and links back to the original task.
A human review gate prevents any agent-generated code from merging without explicit human approval.

Non-functional requirements

Task completion rate: resolve 30%+ of well-scoped GitHub issues autonomously (SWE-bench Lite baseline).
End-to-end latency: 5-30 minutes per task (versus 2-8 hours for a human developer on comparable tasks).
Cost per task: $0.50-$5.00 at GPT-4o pricing ($2.50/1M input, $10/1M output tokens), with 50K-200K tokens consumed per task.
Sandbox startup time: under 5 seconds per task (Docker container creation and codebase mount).
Security: zero agent-initiated network calls outside the sandbox, zero access to production credentials, all file changes captured as diffs (never applied directly to the repo).
Availability: 99.5% uptime for the agent service. Failures result in the task returning to the human queue, never silent data loss.

The hardest engineering problem here: the agent must know when to stop. A coding agent that loops endlessly on a failing test wastes compute dollar. An agent that gives up too early misses solvable problems. The termination policy (max iterations, loop detection, confidence thresholds) is the make-or-break calibration, and it depends on signals from the planner, the test runner, and the diff analyzer simultaneously.

The core entities

CodingTask

task_id, source (github_issue, jira, manual), source_url, title, description, repo_id, branch, status (queued, indexing, planning, coding, testing, pr_opened, failed, completed), created_at, completed_at, total_tokens_used, total_cost_usd

CodebaseIndex

index_id, repo_id, commit_sha, file_count, chunk_count, embedding_model, index_size_mb, indexed_at, languages_detected, dependency_graph_nodes

AgentPlan

plan_id, task_id, version (increments on revision), steps (ordered list of planned file changes), reasoning, estimated_files_changed, created_at, revised_from (nullable, previous plan_id)

CodeChange

change_id, task_id, plan_step, file_path, change_type (create, modify, delete), diff, tokens_used, model_used, created_at

TestExecution

execution_id, task_id, iteration, test_command, exit_code, stdout, stderr, tests_passed, tests_failed, tests_errored, duration_ms, executed_at

SandboxSession

session_id, task_id, container_id, image_tag, cpu_limit, memory_limit_mb, timeout_seconds, status (running, completed, timed_out, killed), created_at, destroyed_at

PullRequest

pr_id, task_id, repo_id, branch_name, title, body, files_changed, insertions, deletions, review_status (pending, approved, changes_requested, merged), created_at

API design

POST /v1/tasks - submit a new coding task

Request: {
  "source": "github_issue",
  "source_url": "https://github.com/acme/api-server/issues/342",
  "repo_id": "repo_acme_api",
  "branch": "main",
  "description": "Add pagination to the /users endpoint. Should support cursor-based pagination with a default page size of 20.",
  "max_iterations": 5,
  "sandbox_config": {
    "timeout_seconds": 1800,
    "memory_limit_mb": 4096
  }
}
Response: {
  "task_id": "task_abc123",
  "status": "queued",
  "estimated_start": "2026-04-11T14:02:00Z",
  "queue_position": 3
}

GET /v1/tasks/{task_id} - check task status and progress

Response: {
  "task_id": "task_abc123",
  "status": "testing",
  "current_iteration": 2,
  "plan": {
    "version": 2,
    "steps": [
      { "step": 1, "file": "src/routes/users.ts", "action": "Add cursor parameter parsing", "status": "done" },
      { "step": 2, "file": "src/db/queries/users.ts", "action": "Implement cursor-based query", "status": "done" },
      { "step": 3, "file": "tests/routes/users.test.ts", "action": "Add pagination tests", "status": "in_progress" }
    ]
  },
  "test_results": {
    "iteration": 2,
    "passed": 14,
    "failed": 1,
    "error_summary": "Expected cursor token format to be base64-encoded, got raw ID"
  },
  "tokens_used": 87420,
  "cost_usd": 1.23
}

POST /v1/tasks/{task_id}/cancel - cancel a running task

Response: {
  "task_id": "task_abc123",
  "status": "cancelled",
  "sandbox_destroyed": true,
  "partial_pr": null
}

GET /v1/tasks/{task_id}/pr - get the generated pull request

Response: {
  "pr_id": "pr_xyz789",
  "url": "https://github.com/acme/api-server/pull/87",
  "title": "feat: Add cursor-based pagination to /users endpoint",
  "files_changed": 3,
  "insertions": 142,
  "deletions": 12,
  "summary": "Implemented cursor-based pagination for the /users endpoint using base64-encoded cursor tokens...",
  "review_status": "pending"
}

POST /v1/repos/{repo_id}/index - trigger codebase indexing

Request: {
  "commit_sha": "a1b2c3d",
  "force_reindex": false
}
Response: {
  "index_id": "idx_def456",
  "status": "indexing",
  "estimated_duration_seconds": 180,
  "file_count": 2340
}

The system has two distinct pipelines: an offline indexing pipeline that processes the codebase into searchable chunks, and an online execution pipeline that runs when a task arrives. The offline pipeline runs once per commit (or on-demand) and takes 5-15 minutes for a large repo. The online pipeline is the agent loop itself: plan, code, test, iterate.

I like to think of the architecture as three layers. The intake layer receives tasks from GitHub webhooks or API calls and queues them. The brain layer (the LLM planner and code generator) orchestrates the work. The execution layer (the sandbox) runs the agent's code and tests in complete isolation. These layers communicate through a task queue and a shared state store, so any component can be scaled or restarted independently.

For your interview: draw the offline indexing pipeline first, then the online execution loop. This demonstrates that you understand prep work (indexing) is separate from real-time work (coding). Candidates who jump straight to "the LLM writes code" miss half the system.

The key feedback loop is the diamond in the middle: TestRunner to Diagnoser to Planner to CodeGen back to Docker. This loop runs up to 5 times (configurable) before the agent either succeeds or escalates. Every iteration costs roughly $0.30-0.80 in tokens, so a 5-iteration task caps at around $4.00.

TL;DR

Multi-strategy code retrieval (AST-aware chunking + dependency graph traversal + semantic search over embeddings + file path matching) assembles a focused context of 10-20 relevant files per task, achieving 2-3x better task completion on SWE-bench than naive keyword search.
Iterative plan-then-execute with plan revision (the ReAct loop applied to coding) solves 27% of SWE-bench Lite issues versus 12% for single-shot generation. The agent plans, writes code, observes test results, and revises the plan for up to 10 iterations.
Layered sandboxing (ephemeral Docker containers with no network, seccomp profiles, writable overlays, defined tool interfaces) prevents catastrophic failures like rm -rf / or credential theft from agent-generated code.
Structured test-feedback parsing (extract specific assertion errors, retrieve the failing test and tested function, generate a diagnosis before a fix) reduces average fix iterations from 4.1 to 2.3 rounds, saving 50-80% of CI compute.
The production lesson: the agent that writes the code is easy. The agent that knows when to stop, when to ask for help, and when its own plan is wrong is the hard part. Termination logic is what separates a demo from a product.

Requirements

Functional requirements

The system accepts a task description (GitHub issue, Jira ticket, or plain text) and produces a working code change that addresses the task.
The agent indexes and understands the target codebase, retrieving relevant files, functions, and dependencies for context assembly.
The agent generates a step-by-step plan for the code change, executes each step, and revises the plan based on intermediate results.
The agent runs the existing test suite (plus any new tests it writes) in a sandboxed environment and iterates on failures until tests pass.
The agent opens a pull request with the code changes, a summary of what was changed and why, and links back to the original task.
A human review gate prevents any agent-generated code from merging without explicit human approval.

Non-functional requirements

Task completion rate: resolve 30%+ of well-scoped GitHub issues autonomously (SWE-bench Lite baseline).
End-to-end latency: 5-30 minutes per task (versus 2-8 hours for a human developer on comparable tasks).
Cost per task: $0.50-$5.00 at GPT-4o pricing ($2.50/1M input, $10/1M output tokens), with 50K-200K tokens consumed per task.
Sandbox startup time: under 5 seconds per task (Docker container creation and codebase mount).
Security: zero agent-initiated network calls outside the sandbox, zero access to production credentials, all file changes captured as diffs (never applied directly to the repo).
Availability: 99.5% uptime for the agent service. Failures result in the task returning to the human queue, never silent data loss.

The hardest engineering problem here: the agent must know when to stop. A coding agent that loops endlessly on a failing test wastes compute dollar. An agent that gives up too early misses solvable problems. The termination policy (max iterations, loop detection, confidence thresholds) is the make-or-break calibration, and it depends on signals from the planner, the test runner, and the diff analyzer simultaneously.

The core entities

CodingTask

task_id, source (github_issue, jira, manual), source_url, title, description, repo_id, branch, status (queued, indexing, planning, coding, testing, pr_opened, failed, completed), created_at, completed_at, total_tokens_used, total_cost_usd

CodebaseIndex

index_id, repo_id, commit_sha, file_count, chunk_count, embedding_model, index_size_mb, indexed_at, languages_detected, dependency_graph_nodes

AgentPlan

plan_id, task_id, version (increments on revision), steps (ordered list of planned file changes), reasoning, estimated_files_changed, created_at, revised_from (nullable, previous plan_id)

CodeChange

change_id, task_id, plan_step, file_path, change_type (create, modify, delete), diff, tokens_used, model_used, created_at

TestExecution

execution_id, task_id, iteration, test_command, exit_code, stdout, stderr, tests_passed, tests_failed, tests_errored, duration_ms, executed_at

SandboxSession

session_id, task_id, container_id, image_tag, cpu_limit, memory_limit_mb, timeout_seconds, status (running, completed, timed_out, killed), created_at, destroyed_at

PullRequest

pr_id, task_id, repo_id, branch_name, title, body, files_changed, insertions, deletions, review_status (pending, approved, changes_requested, merged), created_at

API design

POST /v1/tasks - submit a new coding task

Request: {
  "source": "github_issue",
  "source_url": "https://github.com/acme/api-server/issues/342",
  "repo_id": "repo_acme_api",
  "branch": "main",
  "description": "Add pagination to the /users endpoint. Should support cursor-based pagination with a default page size of 20.",
  "max_iterations": 5,
  "sandbox_config": {
    "timeout_seconds": 1800,
    "memory_limit_mb": 4096
  }
}
Response: {
  "task_id": "task_abc123",
  "status": "queued",
  "estimated_start": "2026-04-11T14:02:00Z",
  "queue_position": 3
}

GET /v1/tasks/{task_id} - check task status and progress

Response: {
  "task_id": "task_abc123",
  "status": "testing",
  "current_iteration": 2,
  "plan": {
    "version": 2,
    "steps": [
      { "step": 1, "file": "src/routes/users.ts", "action": "Add cursor parameter parsing", "status": "done" },
      { "step": 2, "file": "src/db/queries/users.ts", "action": "Implement cursor-based query", "status": "done" },
      { "step": 3, "file": "tests/routes/users.test.ts", "action": "Add pagination tests", "status": "in_progress" }
    ]
  },
  "test_results": {
    "iteration": 2,
    "passed": 14,
    "failed": 1,
    "error_summary": "Expected cursor token format to be base64-encoded, got raw ID"
  },
  "tokens_used": 87420,
  "cost_usd": 1.23
}

POST /v1/tasks/{task_id}/cancel - cancel a running task

Response: {
  "task_id": "task_abc123",
  "status": "cancelled",
  "sandbox_destroyed": true,
  "partial_pr": null
}

GET /v1/tasks/{task_id}/pr - get the generated pull request

Response: {
  "pr_id": "pr_xyz789",
  "url": "https://github.com/acme/api-server/pull/87",
  "title": "feat: Add cursor-based pagination to /users endpoint",
  "files_changed": 3,
  "insertions": 142,
  "deletions": 12,
  "summary": "Implemented cursor-based pagination for the /users endpoint using base64-encoded cursor tokens...",
  "review_status": "pending"
}

POST /v1/repos/{repo_id}/index - trigger codebase indexing

Request: {
  "commit_sha": "a1b2c3d",
  "force_reindex": false
}
Response: {
  "index_id": "idx_def456",
  "status": "indexing",
  "estimated_duration_seconds": 180,
  "file_count": 2340
}

Design an autonomous coding agent

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments

Design an autonomous coding agent

TL;DR

Requirements

Functional requirements

Non-functional requirements

The core entities

API design

High-level design

Continue Reading with Premium

Comments