Agent circuit breaker
Learn how agent circuit breakers prevent infinite retry loops, token burn, and cascade failures when tools fail, mirroring the HLD circuit breaker at the AI layer.
TL;DR
- When a tool fails repeatedly, agents retry it by default, burning tokens, wasting time, and sometimes looping indefinitely. Agent circuit breakers stop this.
- The pattern mirrors the classic distributed systems circuit breaker: track failure rate per tool, transition through CLOSED β OPEN β HALF-OPEN states.
- In OPEN state, calls to the failing tool are rejected immediately with a fallback response, not forwarded to the actual tool. The agent continues with degraded capability rather than stalling.
- HALF-OPEN lets one test call through to check if the tool has recovered. If it succeeds, the circuit closes. If it fails, wait longer before the next probe.
- Track failure rates per tool instance, not globally. A specific API endpoint being down shouldn't trip the circuit for a different, healthy endpoint of the same tool.
The problem it solves
You have a coding agent that relies on a web search tool. The search provider's API goes down. On each task, the agent calls the search tool, gets a 503 error, tries again, gets another 503, tries once more, hits the retry limit, and then reports failure. This wastes 3-5 tool calls and 30+ seconds per task during the outage.
Worse: if the agent's retry logic doesn't have a hard limit, or if the error handling code re-queues failed tasks, you get infinite loops. An agent loop that burns $50/hour in API costs and still produces nothing.
The problem compounds in multi-agent systems. If a shared tool fails, every agent that depends on it enters a retry loop simultaneously. The failed tool gets hit with exponentially more retry traffic, making recovery harder.
What is it?
The agent circuit breaker is a state machine wrapped around each tool in an agent's toolkit. It monitors call success/failure rates and automatically transitions between three states:
- CLOSED (normal): calls pass through to the tool. Failure threshold not yet reached.
- OPEN (failing): calls are rejected immediately without reaching the tool. The agent receives a predefined fallback response. A recovery timer is running.
- HALF-OPEN (testing): the timer has expired. One test call is allowed through. Success β CLOSED. Failure β back to OPEN with a longer timeout.
This is identical in structure to the HLD circuit breaker pattern. The only difference is that the "downstream service" is an LLM tool, and the "caller" is an agent loop rather than a microservice.
How it works
State transitions
Configuration per tool
Each tool gets its own circuit breaker configuration. Tune the thresholds based on the tool's expected availability and criticality:
circuit_breakers = {
"web_search": CircuitBreaker(
failure_threshold=5,
recovery_timeout=30, # seconds in OPEN before HALF-OPEN
half_open_max_calls=1,
),
"code_executor": CircuitBreaker(
failure_threshold=3,
recovery_timeout=60,
half_open_max_calls=1,
),
"memory_store": CircuitBreaker(
failure_threshold=2, # less tolerant, memory is critical path
recovery_timeout=15,
half_open_max_calls=1,
),
}
Agent behavior in OPEN state
When the circuit is open, the tool call returns a fallback response immediately without hitting the actual tool. The agent must handle this gracefully:
def call_tool(tool_name: str, args: dict) -> ToolResult:
cb = circuit_breakers[tool_name]
if cb.state == "OPEN":
return ToolResult(
success=False,
error=f"{tool_name} is currently unavailable (circuit open). Proceeding without this tool.",
data=None,
)
try:
result = tools[tool_name](**args)
cb.record_success()
return result
except Exception as e:
cb.record_failure()
raise
The agent's system prompt or task handler should be designed to handle "tool unavailable" responses gracefully: either skipping that step, using an alternative tool, or surfacing a partial result to the user.
Metrics to track
Per-circuit-breaker, collect:
- State transitions (CLOSED β OPEN events). Alert on these; they indicate real tool outages.
- Rejection count in OPEN state. Tells you how many calls were saved from hitting a broken tool.
- Recovery time. Measures how long the circuit was open before returning to CLOSED.
These metrics are the difference between debugging "the agent is slow" and instantly identifying "the web_search circuit has been open for 15 minutes."
With vs. without a circuit breaker
The following sequence shows how a circuit breaker changes agent behavior when a tool goes down. Without it, every call hits the failing API. With it, calls are rejected instantly once the circuit opens.
Animated lifecycle
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.