Retry library
Low-level design of a configurable retry library -- retry policies (fixed, exponential backoff, jitter), retryable vs non-retryable exceptions, max attempts, timeout budget, and circuit-breaker integration.
The Problem
Your payment service calls a third-party gateway that returns HTTP 503 about 2% of the time. The on-call engineer wraps every call site in a hand-rolled for loop with Thread.sleep(1000) and a hardcoded retry count of 3. Six months later, 14 different services have 14 different retry loops, each with its own bugs: some forget to filter non-retryable errors, some retry forever on 400 Bad Request, and none of them add jitter so every service retries at the exact same instant after a blip.
That synchronized retry storm is called the thundering herd problem. A retry library encapsulates the loop once. The caller hands it a Callable and a RetryConfig that specifies max attempts, backoff strategy, which exceptions are retryable, and an overall timeout budget. The library handles the sleeping, the exception filtering, and the event callbacks for observability.
Design the core classes for a retry library that supports pluggable backoff strategies (fixed, exponential, exponential with jitter), retryable vs non-retryable exception filtering, a fluent builder API, an overall timeout budget, and listener hooks for observability.
Requirements
Clarifying Questions
Before jumping into class design, ask questions to pin down the scope. Cover four areas: core behavior, error handling, boundaries, and extensibility.
You: "Is the retry synchronous (blocking the caller's thread) or asynchronous with a callback/future?"
Interviewer: "Start with synchronous. Design it so we could add async support later without rewriting the core loop."
Good. Synchronous keeps the first pass simple. We will isolate the sleep mechanism behind an interface so async becomes a swap-in later.
You: "Which backoff strategies do we need? Fixed delay only, or exponential too?"
Interviewer: "Support fixed delay, exponential backoff, and exponential backoff with jitter. Make it pluggable so adding a new strategy is a one-class change."
Three built-in strategies behind one interface. That is the Strategy pattern. Adding a fourth strategy (linear, decorrelated jitter) requires zero changes to the retry loop.
You: "Should the library retry on every exception, or can the caller specify which ones are retryable?"
Interviewer: "The caller configures which exceptions are retryable. If an unregistered exception type is thrown, propagate it immediately without retrying."
Exception filtering matters. An IOException is worth retrying. A ValidationException is not. We need a predicate the caller can configure.
You: "Is there an overall timeout budget across all attempts, or just a per-attempt timeout?"
Interviewer: "Both. A per-attempt timeout for individual calls and a total timeout budget that caps the entire retry sequence. If the budget expires mid-retry, stop immediately."
Two timeout dimensions. The budget is the wall-clock ceiling for the whole operation. Per-attempt is the ceiling for a single invocation.
You: "Should we notify anyone when a retry happens or when all retries are exhausted?"
Interviewer: "Yes. Provide listener hooks so callers can log each retry attempt, record metrics on success, and alert when retries are exhausted."
Observer pattern for hooks. A RetryListener interface with onRetry, onSuccess, and onExhausted callbacks. The caller registers listeners at config time.
You: "Does the library need to be thread-safe? Could multiple threads share the same config?"
Interviewer: "Yes. The configuration object is immutable and shared freely. Each execution creates its own local state."
Immutable config, mutable execution state. The RetryConfig is built once via a builder and never modified. Each call to execute() runs in isolation.
You: "Should we integrate with a circuit breaker, or is that separate?"
Interviewer: "Keep it separate for now. Design it so a circuit breaker can wrap the retry logic without code changes."
Clean boundary. The retry library focuses on retrying. A circuit breaker sits outside and decides whether to even attempt the call.
Final Requirements
Functional Requirements:
- Execute a
Callable<T>with automatic retry on failure. - Support pluggable backoff strategies: fixed delay, exponential, and exponential with jitter.
- Allow the caller to specify which exception types are retryable.
- Enforce a maximum number of retry attempts.
- Enforce a total timeout budget across all attempts.
- Notify registered listeners on retry, success, and exhaustion events.
Non-Functional Requirements:
- Thread safety: immutable config shared across threads, mutable state per execution.
- Extensibility: adding a new backoff strategy requires one new class.
- Testability: the sleep mechanism is injectable so unit tests run without real delays.
Out of Scope:
- Async/non-blocking execution (designed for, not implemented).
- Circuit breaker integration (compose externally).
- Distributed coordination (this is an in-process library).
- Persistence of retry history.
Interview tip
State these requirements out loud. Numbering them gives the interviewer anchors to reference: "Let's look at requirement 3, exception filtering." It signals that you think in structured specs, not stream-of-consciousness.
Example Inputs and Outputs
Scenario 1: Transient failure, then success
- Input: Call a payment gateway. First two attempts throw
IOException. Third attempt returnsPaymentResult(OK). - Config: maxAttempts=4, backoff=exponential(100ms, multiplier 2.0), retryOn=IOException.
- Expected: Returns
PaymentResult(OK)after 3 attempts. Listener receives 2onRetrycallbacks then 1onSuccess. - Why: Validates the core retry loop (requirements 1, 2, 6).
Scenario 2: Non-retryable exception
- Input: Call a validation endpoint. First attempt throws
ValidationException. - Config: maxAttempts=4, retryOn=IOException.
- Expected:
ValidationExceptionpropagates immediately. No retry. Listener receives nothing. - Why: Validates exception filtering (requirement 3).
Scenario 3: Timeout budget exceeded
- Input: Call a slow service. Each attempt takes 600ms and fails. Budget is 1500ms.
- Config: maxAttempts=5, backoff=fixed(200ms), timeoutBudget=1500ms.
- Expected: Completes 2 attempts (600ms + 200ms + 600ms = 1400ms), then stops before attempt 3 because the budget would be exceeded. Throws
TimeoutBudgetExceededException. Listener receivesonExhausted. - Why: Validates total timeout budget enforcement (requirement 5).
Try It Yourself
Try it yourself
Before reading the solution, spend 10 minutes sketching the core entities. Think about what varies (backoff calculation, exception filtering) and what stays the same (retry loop, attempt counting). That distinction points directly at the patterns you need. Compare your approach with the walkthrough below.
Step 1: Identify Core Entities
Start by asking: what are the main "things" in this problem? Look for nouns in the requirements: retry, config, backoff, exception filter, timeout, listener. Each noun suggests a class or interface with a single, clear job.
A common mistake is putting all retry logic, backoff calculation, and exception filtering into one giant RetryUtil class. That violates SRP and makes every change risky. Good design means each class does exactly one thing.
| Entity | Responsibility | Key attributes |
|---|---|---|
| Retryer | The orchestrator. Runs the retry loop using the config. | config, sleeper |
| RetryConfig | Immutable configuration. Holds max attempts, backoff, predicate, budget, listeners. | maxAttempts, backoffStrategy, retryPredicate, timeoutBudget, listeners |
| BackoffStrategy | Calculates the delay between attempts. One implementation per algorithm. | (varies by strategy) |
| RetryPredicate | Decides whether a given exception is retryable. | retryableExceptions |
| RetryListener | Observer interface. Receives callbacks on retry, success, and exhaustion. | (callback methods) |
| RetryContext | Per-execution mutable state. Tracks attempt number, elapsed time, last exception. | attemptNumber, startTime, lastException |
| Sleeper | Abstracts Thread.sleep so tests can substitute a fake. | (sleep method) |
Notice we separated RetryConfig from Retryer because config is immutable and shared, while the retryer performs work. We separated BackoffStrategy from RetryConfig because the delay algorithm varies independently from the rest of the configuration. RetryContext is separate from Retryer because each execution needs its own isolated state even when sharing the same retryer instance.
Step 2: Define Relationships and Class Design
RetryConfig (the immutable value object)
RetryConfig holds everything the retryer needs to make decisions. It is built once and never modified.
Deriving state from requirements:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.