Retry with backoff and jitter
How to implement retries safely: exponential backoff, full vs. equal jitter, retry budgets, idempotency requirements, and when retrying makes things worse instead of better.
TL;DR
- Naive retries (immediate, uniform, unlimited) amplify failures into thundering herds that can take down a recovering service.
- Exponential backoff (
base * 2^attempt) spaces retries out, but without jitter, all clients still retry in lockstep. - Jitter (randomizing the backoff delay) is the critical ingredient that desynchronizes clients and prevents coordinated retry storms.
- Always pair retries with a budget (max attempts, max duration) and a circuit breaker to stop retrying when the downstream is genuinely broken.
- Retries are only safe on idempotent operations. Retrying a non-idempotent write without an idempotency key creates duplicates.
The Problem
Your payment service calls a banking partner's API. The partner has a brief network hiccup lasting 2 seconds. Your service has 500 concurrent requests in flight. All 500 get a connection timeout.
Without retries, all 500 users see a payment failure. That's bad. So you add retries.
All 500 clients retry immediately. The partner's API was recovering from a 2-second blip, handling its normal 200 requests/second. Now it gets hit with 500 retries on top of the 200 new requests. 700 requests/second when it can handle 200. The API goes down again. All 700 requests fail. All 700 retry simultaneously. The partner API never recovers because every recovery attempt triggers a new retry wave.
This is the thundering herd problem. Synchronized retries convert a brief hiccup into a sustained outage. The retries that were supposed to help are now the primary cause of the failure.
The forces in tension: you need retries for resilience (transient failures are common in distributed systems), but naive retries amplify failures instead of absorbing them. Backoff and jitter are the mechanism that converts retries from a liability into an asset.
One-Line Definition
Retry with backoff absorbs transient failures by progressively increasing wait times between attempts and adding randomized jitter to prevent synchronized retry storms.
Analogy
You're at a busy coffee shop. You walk up to the counter, but the barista is dealing with a spilled drink. You wait 10 seconds and try again. Still busy. You wait 20 seconds. Still busy. You wait 40 seconds. This time, there's a gap and you get served.
Now imagine 50 people all tried the counter at the same moment, all got turned away, and all came back exactly 10 seconds later. The counter is instantly overwhelmed again. But if each person waited a random amount between 5 and 40 seconds, they'd trickle back in small groups and the barista could handle them. That's jitter.
Solution Walkthrough
The solution has three layers, each building on the last: backoff (space out retries), jitter (desynchronize clients), and budgets (limit total retry impact).
Layer 1: Exponential Backoff
Instead of retrying immediately, increase the wait time exponentially:
wait = min(cap, base * 2^attempt)
Example with base=100ms, cap=30s:
| Attempt | Wait | Cumulative delay |
|---|---|---|
| 1 | 100ms | 100ms |
| 2 | 200ms | 300ms |
| 3 | 400ms | 700ms |
| 4 | 800ms | 1.5s |
| 5 | 1,600ms | 3.1s |
| 6 | 3,200ms | 6.3s |
| 7 | 6,400ms | 12.7s |
| 8+ | 12,800ms (capped at 30s) | varies |
The cap prevents waits from growing indefinitely. Without a cap, attempt 15 would be over 90 minutes. In practice, cap at 30-60 seconds for synchronous calls and 5-15 minutes for background jobs.
But backoff alone isn't enough. If 500 clients all hit a failure at T=0, they all wait 100ms, all retry at T=100ms, all wait 200ms, all retry at T=300ms. The retries are still synchronized. They're just synchronized at exponentially increasing intervals.
Layer 2: Jitter
Jitter adds randomness to break the synchronization. There are three common strategies:
Full jitter (AWS recommendation):
sleep = random_between(0, min(cap, base * 2^attempt))
The sleep time can be anywhere from 0 to the full backoff value. This gives the widest spread. Most clients wait somewhere in the middle. Some retry quickly (lucky), some wait longer (unlucky), and the load spreads out.
Equal jitter:
v = min(cap, base * 2^attempt)
sleep = v/2 + random_between(0, v/2)
Guarantees at least v/2 wait time. The random component only affects the upper half. This prevents the "lucky" zero-wait retries that full jitter allows.
Decorrelated jitter:
sleep = min(cap, random_between(base, previous_sleep * 3))
Each sleep is based on the previous sleep, not the attempt number. Self-correcting: if a sleep was short, the next one trends longer. No exponential formula needed.
Jitter comparison
| Strategy | Min wait | Max wait | Spread quality | Complexity |
|---|---|---|---|---|
| Full jitter | 0 | backoff | Best | Low |
| Equal jitter | backoff/2 | backoff | Good | Low |
| Decorrelated jitter | base | 3x previous | Good | Medium |
| No jitter | backoff | backoff | None (all synchronized) | Lowest |
Full jitter gives the best load spreading. Equal jitter is safer when you need a guaranteed minimum wait. In practice, either works well. The important thing is that different clients desync from each other. I recommend full jitter as the default.
Layer 3: Retry Budgets
Unlimited retries compound problems. A retry budget limits the blast radius.
Max attempts: Typically 3-5 for synchronous calls, 5-10 for async/background jobs. After the budget is exhausted, fail the request (return an error) or route to a dead-letter queue.
Max retry duration: Total time including all waits. A request with 5 retries and exponential backoff should not exceed 60 seconds total. This prevents a single request from holding a thread for minutes.
Retry rate limit: A global cap on how many retries per second across all clients. Requires coordination via a shared counter or token bucket. This is the strongest guarantee (limits amplification factor), but requires infrastructure.
Retry amplification factor: In a 3-tier system (frontend → service A → service B), if each tier retries 3 times, a single user request can generate 3 x 3 = 9 requests to service B. With 4 tiers and 3 retries each: 81 requests. This exponential amplification is why retry budgets matter for deep call chains.
For deep call chains, reduce retry count at each layer. Frontend: 3 retries. Service A: 2 retries. Service B: 1 retry. This keeps the amplification factor manageable (3 x 2 x 1 = 6 instead of 3 x 3 x 3 = 27).
Combining with circuit breakers
Retries and circuit breakers work at different time scales. Retries handle brief, transient failures (a single request that times out). Circuit breakers handle sustained failures (a service that's been returning errors for 30+ seconds).
The flow: retry logic fires first. If the request fails, backoff and retry up to the budget. If the error rate across multiple requests exceeds the circuit breaker threshold, the circuit opens and all subsequent retry attempts are short-circuited (immediate failure, no remote call). This prevents retries from pounding a service that's already proven to be broken.
Implementation Sketch
Here's a production-ready retry function with full jitter and a circuit breaker check:
// Retry with exponential backoff + full jitter
async function retryWithBackoff<T>(
fn: () => Promise<T>,
options: {
maxAttempts: number; // typically 3-5
baseMs: number; // starting delay (100-500ms)
capMs: number; // max delay (30_000ms)
isRetryable: (err: Error) => boolean;
}
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt < options.maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err as Error;
if (!options.isRetryable(lastError)) {
throw lastError; // non-retryable: 400, 404, validation error
}
if (attempt < options.maxAttempts - 1) {
const backoff = Math.min(options.capMs, options.baseMs * 2 ** attempt);
const jitter = Math.random() * backoff; // full jitter
await sleep(jitter);
}
}
}
throw lastError!; // all attempts exhausted
}
// Usage with retryable error classification
const result = await retryWithBackoff(() => paymentApi.charge(orderId, amount), {
maxAttempts: 3,
baseMs: 200,
capMs: 10_000,
isRetryable: (err) => {
const status = (err as HttpError).statusCode;
return status === 429 || status >= 500; // only retry 429 + 5xx
},
});
Key details: the isRetryable function is critical. It prevents retrying 400s and 404s (which will never succeed) and only retries transient errors (429, 5xx, network timeouts). Without this classification, you waste retry budget on permanent failures.
When NOT to Retry
This is where most implementations go wrong. Retrying the wrong thing is worse than not retrying at all.
Non-idempotent operations without idempotency keys: A POST to /payments that charges a card should not retry without an idempotency key. A duplicate request creates two charges. Before retrying any write operation, confirm it's idempotent or add an idempotency key.
4xx client errors: A 400 (bad request), 401 (unauthorized), 403 (forbidden), or 404 (not found) will not succeed on retry. The request is inherently wrong. Only retry on:
| Status Code | Meaning | Retry? |
|---|---|---|
| 400 | Bad request | No |
| 401 | Unauthorized | No (unless token refresh is retried) |
| 403 | Forbidden | No |
| 404 | Not found | No |
| 409 | Conflict | Maybe (if you can resolve the conflict) |
| 429 | Rate limited | Yes (respect Retry-After header) |
| 500 | Internal server error | Yes |
| 502 | Bad gateway | Yes |
| 503 | Service unavailable | Yes |
| 504 | Gateway timeout | Yes |
When downstream is already overwhelmed: If the failure is due to overload, retrying adds fuel to the fire. Pair retries with a circuit breaker that stops retrying when the failure rate exceeds a threshold. The circuit breaker says "stop trying, it's broken" while the retry says "try again, it might work."
On non-transient errors in message consumers: If a Kafka consumer gets a deserialization error, retrying the same message with the same consumer code will fail identically every time. Route to a dead-letter queue instead.
The Retry-After header
When a server returns 429 (rate limited) or 503 (service unavailable), it often includes a Retry-After header specifying how long to wait. Respect this header. It's the server telling you exactly when to retry.
HTTP/1.1 429 Too Many Requests
Retry-After: 5
Your retry logic should check for this header and use it instead of the calculated backoff when present. The server knows its own capacity better than your exponential formula does. If the header says "retry after 60 seconds," don't retry in 2 seconds.
Most HTTP client libraries don't handle Retry-After automatically. You need to check for it explicitly in your retry logic.
Idempotency Keys
For non-idempotent operations you still need to retry (payment APIs, order creation), send an idempotency key with the request:
POST /payments
Idempotency-Key: client-generated-uuid-here
{ "amount": 100, "currency": "USD" }
The server stores the result of the first successful processing keyed by that UUID. Any retry with the same UUID returns the stored result without reprocessing. This makes the endpoint safe to retry.
The key must be generated before the first attempt and reused on all retries. Don't generate a new key per attempt. Store the key alongside the request so that if your client crashes and restarts, the same key is used on the next attempt.
Stripe, Square, and most payment APIs support idempotency keys as a first-class feature. If you're building an API that accepts retries, implement idempotency key support. Without it, you're forcing every client to implement their own deduplication, which they won't do consistently.
When It Shines
- Any synchronous service-to-service call where transient failures are common (network blips, temporary load spikes, rolling deployments).
- API gateway retries that absorb brief downstream hiccups before they reach the user.
- Database connection retries during failovers: the primary goes down, connections fail, the replica promotes in 5-30 seconds, and retries pick up the new primary.
- Message consumer retries for transient downstream failures before routing to a DLQ.
- Client-side retries in mobile/web apps where network connectivity is unreliable.
- Batch job retries where individual item failures shouldn't abort the entire batch.
The pattern works best when failures are brief and the downstream service is healthy most of the time. If the downstream is chronically unhealthy, retries just add load. Use a circuit breaker to detect chronic failure and stop retrying entirely.
For your interview: mention retries as a first-class concern whenever you draw synchronous service calls. The interviewer is looking for you to think about failure modes, not just the happy path.
Failure Modes & Pitfalls
1. Retry storms in deep call chains: In a 4-tier architecture, if every tier retries 3 times, a single request generates 81 downstream calls. Set decreasing retry counts as you go deeper: frontend=3, middleware=2, backend=1. Or use a global retry budget across the call chain.
2. Retrying non-idempotent operations: Retrying a payment charge without an idempotency key creates duplicate charges. This is the most expensive retry bug you can have. Classify every endpoint as idempotent or non-idempotent before adding retry logic.
3. No jitter: Backoff without jitter synchronizes all retries. Amazon's famous "Exponential Backoff and Jitter" blog post showed that full jitter reduces total work by 4-5x compared to exponential backoff alone. Always add jitter.
4. Retrying on 4xx errors: A 400 Bad Request will fail identically on every retry. You're burning retry budget on a permanent error. Classify errors and only retry transient failures.
5. No backoff cap: Without a cap, wait times grow unbounded. Attempt 20 with base=100ms: wait = 100ms * 2^20 = 104,857,600ms = 29 hours. Always cap at a reasonable maximum (30s for synchronous, 15 minutes for background).
6. Retrying at every layer independently: In microservices, if the API gateway, the orchestration service, and the downstream service all have independent retry policies, a single failure cascades into dozens of requests. Coordinate retry policies across layers. The outermost layer should have the most retries; inner layers should have fewer or none.
Trade-offs
| Pros | Cons |
|---|---|
| Absorbs transient failures transparently | Increases end-to-end latency (each retry adds delay) |
| Jitter prevents thundering herd on recovery | Retry amplification in deep call chains can 10x+ load |
| Simple to implement (20 lines of code) | Requires idempotency for safe write retries |
| Works at every layer (client, gateway, service, consumer) | Retrying an overloaded service adds load |
| Pairs well with circuit breakers and DLQs | Jitter introduces non-determinism (harder to test) |
| Exponential backoff naturally adapts to failure duration | Max retry duration can exceed user-facing SLAs |
The fundamental tension is resilience vs amplification. Retries make your system tolerant of transient failures, but each retry is additional load on an already-stressed downstream. Jitter and budgets limit the amplification, but they can't eliminate it entirely. If the downstream is genuinely broken (not just briefly hiccupping), retries only accelerate the failure.
Real-World Usage
AWS SDK implements retries with full jitter as the default for all API calls. The SDK retries on throttling (429), server errors (500/502/503), and transient network errors with a base of 25ms and a cap of 20 seconds. AWS published the foundational blog post "Exponential Backoff and Jitter" that demonstrated full jitter reduces total client work by 4-5x compared to equal jitter and 10x compared to no jitter. Their SDK's retry behavior is configurable per-client.
gRPC has a built-in retry policy specification. You configure retries in the service config:
{
"methodConfig": [{
"name": [{ "service": "payments.PaymentService" }],
"retryPolicy": {
"maxAttempts": 4,
"initialBackoff": "0.1s",
"maxBackoff": "10s",
"backoffMultiplier": 2,
"retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"]
}
}]
}
gRPC also supports "hedged requests" (send the same request to multiple backends simultaneously and take the first response), which is a more aggressive alternative to sequential retries for latency-sensitive paths. Hedging is powerful for read operations but dangerous for writes since multiple backends might all successfully process the request.
Stripe's idempotency keys are the gold standard for safe retries in payment systems. Every Stripe API call can include an Idempotency-Key header. Stripe stores the response for 24 hours and returns the cached result for any duplicate key. This makes it safe to retry charge requests, even for non-idempotent POST endpoints. Their API client libraries include automatic retry with exponential backoff and jitter, using the same idempotency key across all attempts. Stripe retries up to 2 times by default, using a base of 0.5 seconds, and only on network errors and 5xx responses, never on 4xx.
Retry in the wild: real defaults
AWS SDK: base=25ms, cap=20s, max 3 attempts, full jitter. Stripe: base=500ms, max 2 retries, same idempotency key. gRPC: configurable per-method, typically 3-4 attempts with backoff multiplier of 2. Spring Retry: configurable, commonly base=1s with multiplier=2 and max=5. These defaults converge on the same idea: short base, exponential growth, 3-5 attempts max, jitter.
How This Shows Up in Interviews
Retries come up in almost every system design interview, usually as a follow-up to "what happens when this service call fails?" The interviewer is testing whether you understand the dangers of naive retries, not just the happy path.
My recommendation: whenever you draw a synchronous service-to-service call, say: "I'll add retry with exponential backoff and jitter here, 3 attempts max, only on 5xx and timeouts. For payment calls, I'll use idempotency keys to make retries safe." That covers 90% of what the interviewer wants to hear.
The jitter signal
Most candidates mention "exponential backoff." Strong candidates mention jitter. If you explain why jitter matters (desynchronizing clients to prevent thundering herd), you've demonstrated understanding of the distributed systems failure mode, not just the retry pattern.
Depth expected at senior/staff level:
- Explain the thundering herd problem and why jitter solves it.
- Know the difference between full jitter, equal jitter, and decorrelated jitter.
- Calculate retry amplification in a multi-tier call chain (3 tiers x 3 retries = 27x amplification).
- Explain when to retry (transient 5xx) vs when not to (4xx, non-idempotent writes without keys).
- Describe how retries interact with circuit breakers (retry first, circuit breaks when failure rate is sustained).
- Know about idempotency keys and how they make non-idempotent operations safe to retry.
Common follow-up questions and strong answers:
| Interviewer asks | Strong answer |
|---|---|
| "How do you prevent retry storms?" | "Three mechanisms: (1) full jitter desynchronizes clients, (2) retry budgets cap total attempts, (3) circuit breaker stops retrying when failure rate exceeds threshold. In a multi-tier system, decrease retry count at each tier: 3 at the edge, 2 at middleware, 1 at the backend." |
| "What's the difference between backoff and jitter?" | "Backoff increases the delay between retries. Jitter randomizes that delay so different clients retry at different times. Without jitter, all clients that failed at the same moment follow identical backoff schedules and retry in lockstep. Jitter breaks the synchronization." |
| "How do retries interact with timeouts?" | "Each retry attempt has its own timeout. A 3-second timeout with 3 retries means worst-case 9 seconds of waiting. The total retry duration (including backoff waits) must fit within the caller's SLA. If my SLA is 5 seconds, I can't do 3 retries with 3-second timeouts." |
| "Should the load balancer retry or the application?" | "Both, but at different layers. The load balancer retries on connection failures (server unreachable). The application retries on logical failures (5xx response with specific error). The key: the load balancer should only retry on a different backend instance, not the same one." |
| "How do you make a POST endpoint safe to retry?" | "Idempotency keys. The client generates a UUID before the first attempt and includes it in every retry. The server stores the result keyed by UUID and returns the cached result for duplicates. The key must be generated before attempt 1, not per attempt." |
Test Your Understanding
Quick Recap
- Naive retries (immediate, uniform, unlimited) amplify failures into thundering herds that can take down a recovering service.
- Exponential backoff (
base * 2^attempt, capped) spaces retries out, but all clients still retry in lockstep without jitter. - Full jitter (
random(0, backoff)) is the recommended default; it desynchronizes clients and reduces total retry work by 4-5x. - Retry amplification in multi-tier systems is multiplicative: 3 tiers with 3 retries each = 27x load on the innermost service.
- Never retry 4xx errors (permanent failures) or non-idempotent writes without idempotency keys.
- Pair retries with circuit breakers: retry absorbs transient blips, circuit breaker stops retrying during sustained outages.
- Retry budgets (max attempts, max duration, rate limits) contain the blast radius when retries aren't helping.
- The
Retry-Afterheader from the server should always take precedence over your calculated backoff.
Related Patterns
- Circuit breaker: Stops retrying when the downstream is genuinely broken. Retry handles transient failures; circuit breaker handles sustained failures. They're complementary.
- Dead-letter queue: Where messages go after retries are exhausted. Retry is the first line of defense; DLQ is the safety net.
- Bulkhead: Isolates retry load so that retries for one downstream service don't consume resources needed by other services.
- Rate limiting: The server-side counterpart. When a server rate-limits you (429), respect the
Retry-Afterheader and back off. Rate limiting protects the server; retry with backoff protects the client.
Quick Recap
- Naive immediate retries cause thundering herd: all clients retry simultaneously, recreating the load spike that caused the failure. Exponential backoff solves this by increasing wait time between attempts.
- Exponential backoff without jitter still synchronizes clients who hit the failure together. Add full jitter (randomize in the range [0, capped_wait]) to desynchronize retry timing across the fleet.
- Only retry idempotent operations or operations with idempotency keys. Retrying a non-idempotent write (like a payment) without idempotency keys creates duplicate side effects.
- Only retry retriable error codes: 429, 500, 502, 503, 504, and network timeouts. Never retry 4xx errors — they indicate a bad request that will fail the same way on every retry.
- Set a retry budget (max 3-5 attempts) and combine retries with a circuit breaker that stops retrying when the downstream service is clearly down, not just transiently failing.