Job scheduler
Low-level design of an in-process job scheduler -- one-shot and recurring jobs, cron expressions, priority queue scheduling, thread-pool execution, retry with backoff, job state machine, and graceful shutdown.
The Problem
Your backend runs dozens of background tasks: sending reminder emails 24 hours before a booking, retrying failed payment captures every 5 minutes, aggregating analytics at midnight, and cleaning up expired sessions every hour. Today each task uses its own Thread.sleep loop, scattered across services, with no retry logic, no visibility into failures, and no way to cancel a running job.
An in-process job scheduler centralizes all of this. It accepts jobs with different trigger types (run once at a specific time, repeat on a fixed interval, fire on a cron expression), executes them in a managed thread pool, retries failures with exponential backoff, tracks each execution through a state machine, and shuts down gracefully without dropping in-flight work.
Design the core classes for a job scheduler that supports one-shot and recurring triggers, priority-based execution ordering, configurable thread-pool concurrency, retry with exponential backoff, a job execution state machine, and graceful shutdown.
Requirements
Clarifying Questions
Before jumping into class design, ask questions to turn the vague prompt into a concrete specification. Cover four areas: core actions, error handling, boundaries, and future extensions.
You: "What types of scheduling should we support? One-shot at a specific time, fixed-rate recurring, fixed-delay recurring, or cron expressions?"
Interviewer: "All four. One-shot for deferred tasks, fixed-rate and fixed-delay for periodic work, and cron for calendar-based schedules like 'every weekday at 9 AM'."
Four trigger types. That points to a Strategy interface: each trigger computes the next execution time differently.
You: "Should jobs have priorities? For example, a payment retry is more urgent than a log cleanup."
Interviewer: "Yes. When multiple jobs are due at the same instant, higher-priority jobs run first."
Priority affects ordering. A min-heap keyed on (nextRunTime, priority) handles this naturally.
You: "What should happen when a job fails? Retry immediately, retry with backoff, or just mark it failed?"
Interviewer: "Retry with exponential backoff. Each job configures its own max retries and base delay. After exhausting retries, mark the job as dead but still schedule the next recurring execution if applicable."
Per-job retry config. The state machine needs RETRYING and DEAD states. A dead one-shot stays dead; a dead recurring job resets for the next scheduled window.
You: "Is there a limit on how many jobs can run concurrently?"
Interviewer: "Yes. The scheduler has a fixed-size thread pool. When all threads are busy, due jobs wait in the queue until a thread becomes available."
Fixed thread pool. We use an ExecutorService with a configurable pool size.
You: "Should the scheduler support job dependencies, where job B runs only after job A completes?"
Interviewer: "Not in the core design. Mention it as an extension."
Good. No DAG execution for now. Each job is independent.
You: "How should the scheduler shut down? Kill running jobs immediately, or wait for them to finish?"
Interviewer: "Graceful shutdown. Stop accepting new jobs, let running jobs complete up to a configurable timeout, then force-stop anything still running."
Two-phase shutdown: soft stop, then hard stop after timeout. The scheduler needs a lifecycle state of its own.
You: "Do we need persistent storage for jobs, or is in-memory enough?"
Interviewer: "In-memory only. Persistence is an extension."
Perfect. You have now clarified scope and ruled out unnecessary complexity.
Final Requirements
Functional Requirements:
schedule(job, trigger, config)adds a job to the scheduler and returns a handle for cancellation- Support four trigger types: one-shot, fixed-rate, fixed-delay, and cron expression
- Execute due jobs in a fixed-size thread pool, ordered by next-run-time and priority
- Retry failed jobs with exponential backoff up to a per-job max-retry limit
- Track each execution through states: PENDING, RUNNING, COMPLETED, FAILED, RETRYING, DEAD
- Cancel a scheduled job via its handle; cancelled jobs skip future executions
Non-Functional Requirements:
- Thread-safe: concurrent schedule, cancel, and execution calls
- Graceful shutdown with configurable timeout
- Extensible: adding a new trigger type requires one class, no changes to existing code
Out of Scope:
- Persistent job store (extension)
- Job dependency graphs / DAG execution (extension)
- Distributed scheduling / leader election (extension)
- UI dashboard or REST API
Interview tip
Numbering your requirements on the whiteboard makes it easy to reference them later: "This class satisfies requirements 3 and 5." Interviewers love traceable design.
Example Inputs and Outputs
Scenario 1: One-shot reminder email
- Input: Schedule
sendReminderto run once at2026-04-10T09:00:00Zwith priority 5 - Expected: Job enters PENDING. At 09:00 the scheduler picks it up, moves to RUNNING, executes, and transitions to COMPLETED. No reschedule.
- Why: Validates one-shot trigger and basic state machine flow
Scenario 2: Fixed-rate metrics aggregation with failure
- Input: Schedule
aggregateMetricsevery 60 seconds, max 3 retries, base delay 2s - Expected: First execution at T+0 succeeds (COMPLETED). At T+60 it fails. State goes FAILED, then RETRYING. Retries at T+62, T+66, T+74 (exponential backoff: 2s, 4s, 8s). If all retries fail, state becomes DEAD. At T+120 the next recurring execution starts fresh with PENDING.
- Why: Validates retry with backoff, DEAD state, and recurring reschedule after failure
Scenario 3: Cron-based cleanup with cancellation
- Input: Schedule
cleanExpiredSessionswith cron0 * * * *(every hour on the hour), priority 1 - Expected: At the next hour boundary the job runs. After completion, the scheduler computes the following hour boundary and re-enqueues. Cancelling the handle stops all future executions.
- Why: Validates cron trigger computation and cancellation
Try It Yourself
Try it yourself
Before reading the solution, spend 15-20 minutes sketching your own design. Think about: what entity computes the next execution time? How does the scheduler know which job to run next? What data structure gives you the soonest-due job in O(log n)? Compare your approach with the walkthrough below.
Step 1: Identify Core Entities
Start by asking: what are the main "things" in this problem? Scan your requirements for nouns and responsibilities.
"Schedule a job" gives us Job (the work) and JobScheduler (the orchestrator). "Four trigger types" gives us Trigger (the scheduling strategy). "Retry with backoff" needs per-job configuration, so JobConfig wraps retry settings and priority. "Track execution states" gives us JobExecution (one run of a job with its state). "Thread pool" gives us WorkerPool. And schedule() returns a JobHandle for cancellation.
A common mistake is putting everything in one giant Scheduler class. That violates SRP because scheduling logic, execution logic, retry logic, and trigger computation are four separate concerns.
| Entity | Responsibility | Key attributes |
|---|---|---|
| Job | The unit of work. A functional interface wrapping the task to execute. | execute() |
| Trigger | Computes the next execution time. One implementation per trigger type (Strategy). | nextFireTime(after) |
| JobConfig | Groups a job with its trigger, retry policy, and priority. Immutable value object. | jobId, job, trigger, maxRetries, baseDelay, priority |
| JobExecution | Represents one scheduled run of a job. Sits in the priority queue. Tracks state. | config, runAt, attemptNumber, state |
| JobHandle | Token returned at schedule time. Lets the caller cancel future executions. | jobId, cancelled flag |
| WorkerPool | Manages the fixed-size thread pool. Submits jobs and supports graceful shutdown. | executor, pool size |
| JobScheduler | The orchestrator. Owns the priority queue, dispatcher thread, and worker pool. | queue, handles, dispatcher |
Notice that Trigger is separate from JobConfig. A trigger only knows how to compute "when next?" while JobConfig bundles the trigger with retry and priority settings. This separation means you can reuse triggers across different jobs and test them independently.
Step 2: Define Relationships and Class Design
Class Diagram
Class Interface Derivation
JobScheduler (the orchestrator)
The scheduler is the central coordinator. It owns the run queue, the dispatcher loop, and the worker pool.
Deriving state from requirements:
| Requirement | What JobScheduler must track |
|---|---|
| "Schedule a job and return a handle" | Map of jobId to JobHandle |
| "Execute due jobs ordered by time and priority" | PriorityQueue of JobExecution |
| "Fixed-size thread pool" | WorkerPool reference |
| "Graceful shutdown" | Running flag, shutdown state |
Deriving methods from needs:
| Need from requirements | Method |
|---|---|
| "Add a job to the scheduler" | schedule(config): JobHandle |
| "Run the dispatch loop" | start(): void (internal) |
| "Stop accepting and drain" | shutdown(timeout): void |
JobExecution (the queue entry)
Each execution sits in the priority queue. It must be comparable so the heap orders by (runAt, priority).
Deriving state from requirements:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.