πŸ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Patterns

Circuit breaker pattern

Learn how the circuit breaker pattern stops cascading failures by failing fast on broken dependencies, and how three states protect your system.

46 min read2026-03-25mediumcircuit-breakerresiliencemicroservicesfault-tolerancehld

TL;DR

  • A circuit breaker wraps outbound service calls, monitors their failure rate, and trips when failures exceed a threshold β€” returning instant errors instead of waiting for 30-second timeouts to pile up.
  • Three states do all the work: CLOSED (normal traffic), OPEN (all calls blocked, fail-fast), and HALF-OPEN (one probe request to test recovery before reopening).
  • The core trade-off is fast degradation vs transparency: users see immediate errors during OPEN state rather than timeouts, which protects your thread pool and every other service in the mesh β€” at the cost of temporary feature unavailability.
  • Without a circuit breaker, one failing downstream service can exhaust every thread in your caller within seconds, crashing services that have nothing to do with the broken dependency.
  • Pair with timeout (bound each call), retry (transient blips), and bulkhead (resource isolation) β€” all four together form a complete resilience envelope.

The Problem

It's 2 a.m. and your on-call phone rings. Your e-commerce platform is returning 503s across the board. You open dashboards and see something strange: CPU on every app server is near zero, error rates on the Inventory Service and User Service are normal β€” but the Order Service is on fire.

The Order Service calls the Payment Service. The Payment Service's database cluster had a failover 6 minutes ago. During failover, every query to the payments DB hangs for 30 seconds before timing out. The Order Service fires a payment call on every checkout request. Each call ties up a thread for 30 seconds. With 400 concurrent checkouts, you need 400 threads β€” but your thread pool has 200. Threads queue. The queue fills. New requests start timing out at the API gateway before they even reach the Order Service. Then the Cart Service, which calls the Order Service to validate stock, starts failing too.

The Payments database recovered at 2:03 a.m. Your platform didn't recover until 2:17 a.m. β€” because every service upstream was still exhausted from its own thread pool depletion, and the queue of retrying requests relaunched a new wave of 30-second timeouts as it drained.

Four boxes in a row: User, Order Service, Inventory Service, Payment Service. Arrows between them show 30-second hanging calls. Annotation text under each service shows thread pools filling up. Caption states the failure propagates upstream.
Without a circuit breaker, each service holds threads waiting for the next. A failure in Payments exhausts Order Service's thread pool, then Cart Service's, then the API gateway's β€” all within seconds.

The fix isn't more app servers. The fix is a mechanism that detects when the Payment Service is broken and stops trying β€” immediately, for every caller β€” so threads are freed and every other feature keeps running.


One-Line Definition

A circuit breaker wraps outbound calls to detect failure patterns and interrupts the circuit once failures exceed a threshold, returning instant errors until the downstream dependency proves it has recovered.


Analogy

Think about the circuit breaker panel in your home's electrical box. When a fault causes excess current on a circuit β€” say, a short in the kitchen β€” the breaker for that circuit trips instantly. Every appliance on that circuit goes dark. But the rest of your house stays on. The whole house doesn't go dark because of a single kitchen fault.

Now imagine there was no breaker. The excess current would flow through the wiring until something burned. And it wouldn't stop at the kitchen β€” it would travel back through shared copper to other circuits and take those down too.

Software circuit breakers work identically. The "excess current" is a flood of threads waiting for a response that never comes. The breaker trips when enough calls fail, cutting the circuit before those threads exhaust the pool β€” preserving every other feature that shares that pool.


Solution Walkthrough

The circuit breaker sits in the call path between your service and the downstream dependency. It maintains three internal states:

Three state ellipses: CLOSED (green, all requests flow through), OPEN (red, all requests blocked), and HALF-OPEN (yellow, one probe request allowed). Arrows show transition conditions between states.
The circuit breaker transitions through three states β€” normal operation, full blocking, and a controlled probe phase. Recovery is always tested before full traffic resumes.

CLOSED β€” normal operation

This is the happy path. All requests pass through to the downstream service. The circuit breaker counts failures in a sliding window β€” by default, tracking the last N requests or the last T seconds.

When failures exceed the configured threshold (e.g., 5 failures in 30 seconds, or 50% error rate over the last 20 requests), the circuit trips to OPEN.

sequenceDiagram
    participant C as βš™οΈ Caller
    participant CB as ⚑ Circuit Breaker (CLOSED)
    participant D as πŸ—„οΈ Downstream

    C->>CB: call()
    CB->>D: forward request
    D-->>CB: HTTP 200 Β· 45ms
    CB-->>C: success Β· failure_count=0

    Note over CB: 5th failure
    C->>CB: call()
    CB->>D: forward request
    D-->>CB: HTTP 500 Β· timeout
    CB-->>C: error Β· failure_count=5
    Note over CB: ⚑ TRIPS TO OPEN

OPEN β€” fail-fast

In OPEN state, the circuit breaker short-circuits every call β€” it doesn't even attempt to reach the downstream service. The caller receives an immediate error (< 1ms), freeing the thread instantly.

The circuit stays OPEN for a configured sleep window (typically 30–60 seconds). After that window elapses, it transitions to HALF-OPEN.

sequenceDiagram
    participant C as βš™οΈ Caller
    participant CB as ⚑ Circuit Breaker (OPEN)
    participant D as πŸ—„οΈ Downstream

    Note over CB: Sleep window: 30s<br/>No calls forwarded
    C->>CB: call()
    CB-->>C: ❌ CircuitOpenException · < 1ms
    Note over C: Thread freed immediately<br/>No timeout wait

    C->>CB: call()
    CB-->>C: ❌ CircuitOpenException · < 1ms

    Note over CB: 30s elapsed β†’ HALF-OPEN

HALF-OPEN β€” controlled recovery probe

After the sleep window, the circuit transitions to HALF-OPEN. It allows exactly one test request through to the downstream service.

Circuit breaker in HALF-OPEN state with one probe request going through and multiple requests blocked. If probe succeeds, transitions to CLOSED. If fails, returns to OPEN.
HALF-OPEN is deliberate caution: one request proves recovery before the floodgates reopen. This prevents the thundering-herd re-trip that would happen if all queued requests hit the recovering service simultaneously.

If the probe request succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit returns to OPEN and the sleep timer resets. Some implementations require N consecutive successes before closing β€” protecting against a service that's intermittently recovering.

sequenceDiagram
    participant C as βš™οΈ Caller
    participant CB as ⚑ Circuit Breaker (HALF-OPEN)
    participant D as πŸ—„οΈ Downstream

    Note over CB: 30s elapsed β€” probing
    C->>CB: call()
    CB->>D: probe request allowed
    D-->>CB: HTTP 200 Β· 35ms Β· recovered
    CB-->>C: success
    Note over CB: βœ… β†’ CLOSED<br/>Normal traffic resumes

    Note over CB,D: OR: probe fails
    C->>CB: call()
    CB->>D: probe request
    D-->>CB: HTTP 500
    CB-->>C: error
    Note over CB: ⚑ β†’ OPEN again<br/>Reset 30s timer

How the Failure Window Works

The threshold isn't measured against all-time calls β€” it uses a sliding window so the circuit responds to current conditions, not ancient history.

Eight time-slot boxes from t=0s to t=35s. Some are green (OK) and some are red (FAIL). A dashed purple rectangle highlights the most recent 30-second window where 4 failures occur. Caption explains the sliding window discards old events.
The sliding window discards failures that are older than the window. Only current failures count toward the threshold β€” preventing a bad minute from tripping the circuit hours later.

Two window types exist in practice:

Window typeHow it worksBest for
Count-basedTrack the last N requests. If error rate > X%, trip.Services with steady, predictable traffic
Time-basedTrack all requests in the last T seconds. If failures > threshold, trip.Services with bursty or highly variable traffic

Count-based windows are simpler to reason about: "trip if 5 of the last 10 requests failed." Time-based windows are more responsive to time-bounded spikes but require careful tuning of both the window duration and failure count.

The minimum request volume guard is a detail most candidates skip: you should never trip a circuit if the window contains fewer than N requests. Without this, 2 failures out of 2 total requests = 100% error rate and an immediate trip β€” even though those 2 failures might be completely normal cold-start noise. Most production-grade libraries enforce a minimum volume threshold of 5–20 requests before allowing a trip.


Implementation Sketch

Here's a production-quality circuit breaker in TypeScript. This is a sketch β€” real implementations use library-grade atomic counters for concurrency safety.

// circuit-breaker.ts β€” SKETCH
// Limitations: (1) uses consecutive-failure count, not a true sliding window;
// (2) single-threaded probe guard (see probeInFlight) β€” safe for Node.js event loop
// but not for multi-threaded runtimes. Use Resilience4j or Polly in production.
type State = "CLOSED" | "OPEN" | "HALF_OPEN";

interface CircuitBreakerConfig {
  failureThreshold: number;     // e.g. 5 β€” failures before trip
  successThreshold: number;     // e.g. 2 β€” consecutive successes to close
  timeout: number;              // e.g. 30000 β€” ms to stay OPEN before probe
  volumeThreshold: number;      // e.g. 10 β€” min requests before trip allowed
}

class CircuitBreaker {
  private state: State = "CLOSED";
  private failureCount = 0;
  private successCount = 0;
  private requestCount = 0;
  private openedAt: number | null = null;
  private probeInFlight = false; // ensures only one probe in HALF_OPEN at a time

  constructor(
    private readonly fn: (...args: unknown[]) => Promise<unknown>,
    private readonly config: CircuitBreakerConfig
  ) {}

  async call(...args: unknown[]): Promise<unknown> {
    if (this.state === "OPEN") {
      if (Date.now() - this.openedAt! >= this.config.timeout) {
        this.state = "HALF_OPEN";
        this.probeInFlight = false; // reset for new probe window
      } else {
        throw new Error("CircuitOpenException: downstream unavailable");
      }
    }

    // In HALF_OPEN: allow only one probe. All concurrent callers get an error.
    if (this.state === "HALF_OPEN") {
      if (this.probeInFlight) {
        throw new Error("CircuitOpenException: probe in flight, downstream unavailable");
      }
      this.probeInFlight = true;
    }

    this.requestCount++;

    try {
      const result = await this.fn(...args);
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    if (this.state === "HALF_OPEN") {
      this.probeInFlight = false; // probe returned β€” gate is now open for next probe
      this.successCount++;
      if (this.successCount >= this.config.successThreshold) {
        this.state = "CLOSED";  // βœ… fully recovered
        this.successCount = 0;
        this.requestCount = 0;
      }
    }
  }

  private onFailure(): void {
    this.probeInFlight = false; // probe failed β€” allow next probe attempt after sleep
    this.failureCount++;
    this.successCount = 0;
    if (this.state === "HALF_OPEN") {
      // Any probe failure immediately re-trips β€” reset sleep window
      this.state = "OPEN";
      this.openedAt = Date.now();
      return;
    }
    if (
      this.state !== "OPEN" &&
      this.requestCount >= this.config.volumeThreshold &&
      this.failureCount >= this.config.failureThreshold
    ) {
      this.state = "OPEN";    // ⚑ tripped
      this.openedAt = Date.now();
    }
  }

  getState(): State { return this.state; }
}

Usage with a real HTTP client:

const paymentBreaker = new CircuitBreaker(
  (orderId: string) => paymentClient.charge(orderId),
  {
    failureThreshold: 5,
    successThreshold: 2,
    timeout: 30_000,      // 30 second sleep before probe
    volumeThreshold: 10,   // need at least 10 requests before tripping
  }
);

// In order handler
try {
  const receipt = await paymentBreaker.call(orderId);
  return { success: true, receipt };
} catch (err) {
  if (err.message.includes("CircuitOpenException")) {
    // Fast path: return a degraded response, don't wait
    return { success: false, reason: "Payment service temporarily unavailable" };
  }
  throw err;
}

This sketch uses consecutive-failure counting, not a sliding window

The failureCount above resets to zero on any success β€” meaning it tracks "failures since last success," not a true sliding window over the last N requests. Real libraries (Resilience4j COUNT_BASED, Polly's AdvancedCircuitBreaker) maintain a circular buffer of the last N outcomes and compute error rate across all of them. A mid-stream success doesn't reset the window. Build from a library; don't hand-roll the windowing logic.

Always handle CircuitOpenException separately

The error path from a circuit open state is structurally different from a downstream 500. A CircuitOpenException means you shouldn't retry β€” retrying would immediately fail again and adds no value. Catch it separately, log it as a known degraded state, and return a graceful fallback. Don't let it bubble up as a generic 503.


Fallback Strategies

A circuit breaker alone only stops the bleeding β€” it still leaves the user with an error. The real design question is: what do you return when the circuit is open?

Fallback strategyWhen to useTrade-off
Stale cacheRead-heavy data that tolerates brief staleness (product catalog, prices, recommendations)Users see slightly outdated data β€” usually acceptable
Degraded responseFeature is non-critical (trending list, recommendations, social counts)Return empty/default instead of failing entirely
Queue for laterWrites that can be deferred (order confirmations, analytics events)Adds latency to eventual processing; requires queue durability
Return errorOperations where staleness is unacceptable (payment auth, inventory check for purchase)Honest failure β€” user must retry, but system remains stable
Static defaultConfiguration, feature flagsReturn last known value or a safe hardcoded default

My recommendation: design the fallback before you write the circuit breaker. The fallback decision drives your threshold configuration β€” a service backing critical writes needs a tighter threshold and correct error return; a service backing recommendations can tolerate a looser threshold and a stale cache fallback.

Stale cache + circuit breaker: a natural combination

If your circuit breaker trips on the Recommendations Service, the best fallback is usually your application-level cache: return the last response you successfully got from that service. The cache key is already there β€” you just extend its TTL indefinitely during the OPEN state and serve it. Most users won't notice the recommendations are 5 minutes stale. They'll definitely notice a 500.


Where to Place Circuit Breakers

Every outbound network call in every service deserves its own circuit breaker. Not one per service β€” one per call site, so a flaky database doesn't trip the breaker wrapping your auth service.

API Gateway connected to three services: Order, User, Payment. Each outbound call arrow has a CB (circuit breaker) guard label. Each service also has a CB guard on its database calls.
Place a circuit breaker on every outbound call β€” at both the service-to-service and service-to-database boundaries. A failing Orders DB should only affect Order Service features, not the entire mesh.

The granularity question comes up in interviews: should you have one circuit breaker per service or per endpoint?

The answer depends on the failure surface:

  • Per-service is sufficient if all endpoints on that service share the same database or infrastructure. One DB outage takes all endpoints down anyway.
  • Per-endpoint is better if different endpoints have different failure profiles (e.g., a POST /charge endpoint is fragile but GET /balance is fast and lightweight). You don't want a slow charge endpoint to trip the breaker on balance checks.

Placement in a Service Mesh

When you're running a service mesh like Istio or Linkerd, circuit breakers can also be configured at the infrastructure level β€” in the sidecar proxy, not in your application code. This is worth knowing for interviews.

# Istio DestinationRule β€” circuit breaker at the proxy layer
apiVersion: networking.istio.io/v1  # v1alpha3/v1beta1 are deprecated as of Istio 1.22
kind: DestinationRule
metadata:
  name: payment-service-cb
spec:
  host: payment-service.default.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5          # trip on 5 consecutive 5xx
      interval: 10s                     # evaluated every 10 seconds
      baseEjectionTime: 30s             # eject for 30 seconds
      maxEjectionPercent: 100           # allow ejecting all instances if all bad

The upside: no library dependency, language-agnostic, and centrally observable. The downside: coarser granularity (per-service, not per-endpoint), and you don't control the fallback logic in the proxy β€” it just returns a 503. For rich fallback behavior (stale cache, queued writes), you still need application-level circuitry.


When It Shines

So when does this pattern actually matter in an interview? Bring it up any time you have synchronous service-to-service calls β€” which is most microservices designs.

Use circuit breakers when:

  • You have service-to-service calls where the downstream can be slow or fail (i.e., virtually always in microservices).
  • The downstream service's SLA is lower than your caller's SLA. If payments can be slow, orders must not be slow.
  • A downstream failure should not cascade β€” your system has non-failing features you need to protect.
  • You have fan-out calls β€” one API request triggers 3+ downstream calls. One slow dependency drags all of them.
  • You're calling an external third-party API (Stripe, Twilio) with no SLA guarantee.

Skip (or simplify) when:

  • You have a single-service monolith with no remote calls. There's nothing to protect against.
  • The downstream call is already wrapped in an async queue with no synchronous wait. A Kafka consumer doesn't hold threads waiting for a database β€” it processes at its own rate and can dead-letter failed messages.
  • You're in a batch/offline pipeline. A nightly ETL job that fails is better retried at the job level, not circuit-broken at the call level.

The rule of thumb: any synchronous HTTP call across a service boundary needs a circuit breaker. Any call where one service failing shouldn't surface as a failure in a different (unrelated) service.


Failure Modes & Pitfalls

1. Threshold too sensitive β€” constant tripping

Setting failureThreshold: 1 means a single 500 trips the circuit. During a normal deploy with a ~5-second rolling restart, you'll flip services in and out of OPEN state every few minutes. Your fallback logic gets exercised constantly, users see degraded responses, and your circuit breaker logs become noise.

I'll often see this in teams that use default library configs without tuning them to their traffic profile. The fix: start conservative. Set the threshold to 5 failures out of at least 10 recent requests over a 30-second window. Tighten from there after observing real traffic patterns.

2. Half-open probe succeeds on a fluke β€” premature close

The service is still degraded, returning 200 on 20% of requests. Your single HALF-OPEN probe hits the lucky 20% and the circuit closes. Full traffic floods in. 80% of requests fail again. The circuit trips back to OPEN within seconds. Users experience a brief recovery window, then degradation returns.

The fix: require N consecutive successes in HALF-OPEN before closing β€” not just one. Setting successThreshold: 3 means three consecutive probes must all succeed before the circuit closes for real.

3. The pent-up demand re-trip

This one is subtle and doesn't appear in most articles. When a circuit has been OPEN for 30+ seconds, clients back off β€” but they don't stop incoming requests. Those requests queue at the gateway or retry. When the circuit finally closes, all that queued demand hits the recovering downstream service simultaneously.

The recovering service, not yet at full capacity, gets a spike of 10Γ— normal traffic. It can't handle it. Error rate spikes. The circuit trips again within seconds of closing.

This is a form of thundering herd caused by the circuit breaker's own recovery event, not the original failure. Mitigations: gradual traffic ramping in HALF-OPEN (allow 1%, then 5%, then 25%, then 100%), or a short secondary sleep window between close and full traffic.

4. No fallback β€” just an error

A circuit breaker that trips and returns a raw exception to the user is better than an infinite timeout β€” but only marginally. The real value comes from pairing the circuit with a fallback strategy: stale cache, default value, or queued retry.

The mistake I see most often is teams adding circuit breakers as a defensive measure but never designing the fallback. When the circuit trips in production, users see a generic error page. The error is fast β€” but it's still an error.

5. Invisible state β€” no observability

If your circuit breaker trips and nobody knows because there's no metric, no alert, and no dashboard, you'll discover it in customer complaints 20 minutes later β€” not immediately from your monitoring. Circuit state must be a first-class metric.

At minimum:

  • Counter: circuit_breaker_state_open{service="payment"} β€” alert if any breaker is OPEN for > 60 seconds.
  • Histogram: circuit_breaker_call_duration_ms β€” separate buckets for CLOSED vs OPEN vs HALF-OPEN.
  • Counter: circuit_breaker_trips_total{service="payment"} β€” alert if trip rate exceeds baseline.

Trade-offs

ProsCons
Prevents cascading failures β€” one broken dependency doesn't bring down the meshAdds complexity β€” every call site needs correct configuration and fallback logic
Fail-fast returns < 1ms vs 30s timeout β€” frees threads immediatelyThreshold misconfiguration causes either too-frequent trips (noisy) or too-slow trips (not protective)
Explicit degraded state β€” your service knows which dependency is failingFalse trips during rolling deployments or brief network hiccups cause unnecessary degradation
Forces you to design fallbacks β€” the pattern demands you answer "what if this service is down?"Half-open probe semantics vary by library β€” behavior can surprise you in production
Observable β€” circuit state is a clear operational health signalDoesn't solve the root cause β€” the downstream service is still broken
Works at the call-site level β€” one bad endpoint doesn't kill a healthy endpointRecovery stampede is a real operational risk if not handled

The fundamental tension here is fast failure vs transparent failure. A circuit breaker makes failures faster and contained β€” but it actively hides them from certain layers of your system. Your Order Service appears to function (it's returning fast fallback responses) while the Payment Service is completely broken. This is the right trade-off for resilience, but it means you need excellent observability to know your system's actual health, not just its response rates.


Real-World Usage

Netflix β€” Hystrix and the origin story

Netflix is where the circuit breaker pattern became mainstream in microservices. When they decomposed their monolith into hundreds of services (2010–2012), cascading failures became their #1 operational challenge. A slow recommendation model or a slow A/B testing service would stall video playback requests that fetched dozens of enrichments simultaneously.

Their solution, Hystrix, added circuit breakers, bulkheads, and fallbacks to every inter-service call. The key decision: every Hystrix-wrapped call ran in its own dedicated thread pool β€” so one service's circuit consuming 100 threads couldn't affect another service's thread pool at all. Netflix published that Hystrix shed billions of thread executions per day at peak, with the majority in fallback state. The lesson: in a service mesh, fallback is not a rare edge case β€” it's a constant operational state.

Hystrix is now maintenance-only. Its successor is Resilience4j, which is lighter-weight and doesn't require a dedicated thread pool per command β€” it uses decorators on existing futures instead.

Stripe β€” silent circuit breakers on external API calls

Stripe's reliability engineering wraps calls to external banking partners (card networks, bank APIs) in circuit breakers with stale-data fallbacks β€” a pattern consistent with their public engineering talks on payment resiliency. When a card network's API becomes slow, the approach avoids returning hard payment failures: instead, the last-known authorization state from a short-lived cache is returned, while monitoring alerts on the circuit state. The user sees the transaction as pending rather than failed. The key insight from Stripe's reliability engineering: for payment-adjacent reads, a brief stale response is almost always less damaging than a visible failure.

Amazon β€” "blast radius" containment

Amazon's engineering culture around availability β€” documented extensively in Werner Vogels's writings and re:Invent talks β€” demands that every service's SLA chain be arithmetically satisfiable without manual coordination. A practical consequence of this is anchoring circuit breaker timeouts to SLA arithmetic: if the Order Service has a 500ms end-to-end SLA and the Inventory Service's p99 is 120ms, any Inventory call held longer than ~380ms is already SLA-violating and should be cut. Whether Amazon encodes this formula literally as policy isn't public β€” but the math is sound and directly applicable to your own service design.


How This Shows Up in Interviews

Here's the honest answer: every microservices design question should have circuit breakers in it, but almost nobody draws them unless prompted. The interviewer is testing whether you know that synchronous service calls are inherently fragile β€” not just naming the pattern.

My recommendation: as soon as you draw a service-to-service call in any design, add a small "CB" label on the arrow and say: "I'll add circuit breakers on all outbound calls here β€” in OPEN state, callers get immediate errors and fall back to stale cache or a degraded response." One sentence. Then move on. Don't spend five minutes explaining the state machine unless you're asked to go deeper.

When to bring it up proactively

Draw circuit breakers when you draw microservice architectures, external API integrations (payment providers, SMS gateways), or any system with synchronous fan-out (one request β†’ multiple downstream calls). Say: "I'll circuit-break every call here so one service failure doesn't cascade. During OPEN state, the Order Service falls back to [specific fallback] β€” this limits blast radius to only the features that depend on [that service]." Naming the blast radius is the signal that you understand what you're protecting.

Depth expected at senior/staff level β€” what separates a surface answer from a strong one:

Don't just name the pattern β€” describe the threshold and fallback

Saying "I'd add a circuit breaker here" without specifying the failure threshold, sleep window, and fallback strategy signals you've memorised a pattern name without understanding its mechanics. The follow-up question "what's your failure threshold and why?" is guaranteed. Say: "5 failures in the last 30-second window, sleep for 30 seconds, require 2 consecutive probe successes before closing. Fallback: serve last-known value from the application cache."

Depth expected at senior/staff level:

  • Explain all three states precisely: CLOSED β†’ OPEN β†’ HALF-OPEN β†’ CLOSED/OPEN, with transition conditions.
  • Distinguish count-based vs time-based sliding windows and when each is appropriate.
  • Describe the minimum volume guard β€” why you shouldn't trip on 5 failures if total request count is 5.
  • Name the pent-up demand problem and how to mitigate it (gradual re-admission in HALF-OPEN, or permittedNumberOfCallsInHalfOpenState > 1 in Resilience4j).
  • Compare application-level CBs (Resilience4j, Polly) vs infrastructure-level (Istio outlierDetection) β€” and why you might want both.
  • Describe what circuit state observability looks like: circuit state as a time-series metric, alert on sustained OPEN, separate latency histogram per state.

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"What's your failure threshold and how do you tune it?""Start at 5 failures in 30 seconds with a minimum volume of 10 requests. This prevents cold-start noise from tripping the circuit. Then observe real traffic β€” if trips are frequent during normal deploys, raise the threshold. If the circuit takes too long to trip during real outages, lower it. Always anchor threshold to observed p99 latency and error rate distributions in prod."
"What happens to requests during the OPEN state?""They get an immediate CircuitOpenException β€” sub-millisecond. The caller should catch this separately from a downstream 500, not retry it, and execute the fallback path. Retrying a CircuitOpenException is useless and wastes CPU."
"How does Istio's outlierDetection differ from Resilience4j?""Istio implements outlier detection at the proxy layer β€” language-agnostic and centrally configured, but with coarser granularity (per-instance ejection, not per-endpoint trip), and no application-layer fallback logic. Resilience4j lives in your code and gives you full control over fallbacks and retry integration. For polyglot service meshes I'd want both: Istio for infrastructure-level protection + Resilience4j for application-aware fallbacks."
"What's the difference between a circuit breaker and a retry?""Retries are for transient failures β€” a 500 that'll likely succeed on attempt 2. Circuit breakers are for sustained failures β€” when the downstream service is measurably broken and retry is actively harmful (every retry wastes a thread during the timeout). They're complementary: retry first, but if the error rate is high enough to trip the circuit, stop retrying entirely and fail fast."
"A service has occasional spikes to 5xx under high load but recovers in 2 seconds. How do you configure the circuit breaker to not trip on load spikes?""Increase the minimum volume threshold to 20 requests and expand the window to 60 seconds. The threshold counts should represent a sustained failure, not a burst. Alternatively, use a time-based window with a 5% error rate threshold rather than a count β€” at high traffic, 5% of 1000 requests is 50 errors, which is genuinely bad. At low traffic, 5% of 10 requests is 0.5 errors, which is noise."

Know these answers cold β€” circuit breaker state machines, threshold configuration, and the distinction from retry all come up in virtually every senior microservices interview.


Variants

Count-based vs time-based windows

Count-based: Trip when X of the last N requests fail. Simple and predictable. Works well for services with steady traffic. The risk: at low traffic (10 requests/min), the window might span 5 minutes β€” you're measuring failures from 5 minutes ago.

Time-based: Trip when X failures occur within the last T seconds, or when error rate % exceeds threshold over the last T seconds. More responsive to time-bounded incidents. Requires careful calibration β€” a 10-second window at 10K req/s has far more data than a 10-second window at 10 req/min.

Resilience4j supports both through its SlidingWindowType.COUNT_BASED and SlidingWindowType.TIME_BASED configuration.

Per-instance vs aggregate

In a multi-instance deployment, do you trip each instance's circuit independently, or aggregate failure counts across all instances?

Per-instance (default): Each instance has independent state. One instance observing a bad upstream connection trips its local circuit. Other instances keep trying. This is more resilient to false trips but means different callers see different behavior.

Aggregate: Centralized circuit state (e.g., stored in Redis). When enough instances report failures, the circuit trips globally. Consistent behavior across the fleet, but requires distributed coordination and adds Redis as a dependency in the failure path.

Most implementations use per-instance by default. For extreme consistency requirements, centralised state is warranted β€” but it's complex and I'd advise against it unless per-instance behavior has caused measurable problems in production.

Client-side vs server-side

A circuit breaker can live in:

  1. The caller (client-side): the calling service wraps its own outbound calls. Standard approach β€” the caller controls the fallback logic.
  2. The infrastructure (service mesh): the sidecar proxy monitors calls and ejects unhealthy upstream instances. No fallback logic, but language-agnostic.
  3. The called service (server-side rate limiting + circuit): the downstream service limits incoming calls via rate limiting + queue depth. Not quite the same pattern, but solves a similar stability problem.

In practice, client-side + service mesh is a complementary combination, not a choice between them.


Test Your Understanding


Quick Recap

  1. A circuit breaker wraps outbound calls and monitors failure rate β€” when failures exceed a threshold, it trips to OPEN state and returns instant errors instead of waiting for timeouts to exhaust your thread pool.
  2. Three states drive everything: CLOSED (normal), OPEN (fail-fast, sub-millisecond errors), and HALF-OPEN (one probe request to test recovery before reopening fully).
  3. The sliding window discards old failure events β€” only failures within the configured window count, which prevents a bad minute from tripping the circuit hours later when the service has recovered.
  4. Always configure a minimum volume threshold so the circuit doesn't trip on 5 failures if you've only served 5 requests β€” cold-start noise and rolling deploy windows will false-trip an untested circuit breaker.
  5. The most dangerous operational failure mode is the pent-up demand re-trip: when the circuit closes after a long OPEN period, queued retries flood in simultaneously and can re-trip the breaker before the recovering service stabilises.
  6. Application-level circuit breakers (Resilience4j, Polly) give you fallback logic and endpoint-level granularity. Service mesh circuit breakers (Istio outlierDetection) are language-agnostic and centrally observable. Use both in production.
  7. Circuit state is a first-class operational metric β€” alert on any breaker in OPEN state for > 60 seconds; treat it as a production incident, not background noise.

Related Concepts

  • Bulkhead Pattern β€” The circuit breaker's sister pattern. While the circuit breaker stops calls when a service fails, bulkheads isolate thread pools so one service's failure can't consume resources reserved for other services. The two are almost always used together.
  • Rate Limiting β€” Rate limiting protects services from overload at the ingress; circuit breakers protect callers from downstream failures at the egress. Same resilience goal, different position in the call chain.
  • Microservices β€” The context in which circuit breakers become critical. Synchronous service-to-service calls are the failure surface that circuit breakers protect. Every microservices design should have them.
  • Service Mesh β€” Istio and Linkerd implement outlier detection (infrastructure-level circuit breaking) at the sidecar proxy layer, complementing application-level circuit breakers.
  • Caching β€” The most common fallback strategy for a tripped circuit breaker. When the circuit opens on a read-heavy service, serving stale cached data keeps users functional while the downstream service recovers.

Next

Bulkhead pattern

Comments

On This Page

TL;DRThe ProblemOne-Line DefinitionAnalogySolution WalkthroughCLOSED β€” normal operationOPEN β€” fail-fastHALF-OPEN β€” controlled recovery probeHow the Failure Window WorksImplementation SketchFallback StrategiesWhere to Place Circuit BreakersPlacement in a Service MeshWhen It ShinesFailure Modes & Pitfalls1. Threshold too sensitive β€” constant tripping2. Half-open probe succeeds on a fluke β€” premature close3. The pent-up demand re-trip4. No fallback β€” just an error5. Invisible state β€” no observabilityTrade-offsReal-World UsageHow This Shows Up in InterviewsVariantsCount-based vs time-based windowsPer-instance vs aggregateClient-side vs server-sideTest Your UnderstandingQuick RecapRelated Concepts