Job scheduler

The Problem

Your backend runs dozens of background tasks: sending reminder emails 24 hours before a booking, retrying failed payment captures every 5 minutes, aggregating analytics at midnight, and cleaning up expired sessions every hour. Today each task uses its own Thread.sleep loop, scattered across services, with no retry logic, no visibility into failures, and no way to cancel a running job.

An in-process job scheduler centralizes all of this. It accepts jobs with different trigger types (run once at a specific time, repeat on a fixed interval, fire on a cron expression), executes them in a managed thread pool, retries failures with exponential backoff, tracks each execution through a state machine, and shuts down gracefully without dropping in-flight work.

Design the core classes for a job scheduler that supports one-shot and recurring triggers, priority-based execution ordering, configurable thread-pool concurrency, retry with exponential backoff, a job execution state machine, and graceful shutdown.

Requirements

Clarifying Questions

Before jumping into class design, ask questions to turn the vague prompt into a concrete specification. Cover four areas: core actions, error handling, boundaries, and future extensions.

You: "What types of scheduling should we support? One-shot at a specific time, fixed-rate recurring, fixed-delay recurring, or cron expressions?"

Interviewer: "All four. One-shot for deferred tasks, fixed-rate and fixed-delay for periodic work, and cron for calendar-based schedules like 'every weekday at 9 AM'."

Four trigger types. That points to a Strategy interface: each trigger computes the next execution time differently.

You: "Should jobs have priorities? For example, a payment retry is more urgent than a log cleanup."

Interviewer: "Yes. When multiple jobs are due at the same instant, higher-priority jobs run first."

Priority affects ordering. A min-heap keyed on (nextRunTime, priority) handles this naturally.

You: "What should happen when a job fails? Retry immediately, retry with backoff, or just mark it failed?"

Interviewer: "Retry with exponential backoff. Each job configures its own max retries and base delay. After exhausting retries, mark the job as dead but still schedule the next recurring execution if applicable."

Per-job retry config. The state machine needs RETRYING and DEAD states. A dead one-shot stays dead; a dead recurring job resets for the next scheduled window.

You: "Is there a limit on how many jobs can run concurrently?"

Interviewer: "Yes. The scheduler has a fixed-size thread pool. When all threads are busy, due jobs wait in the queue until a thread becomes available."

Fixed thread pool. We use an ExecutorService with a configurable pool size.

You: "Should the scheduler support job dependencies, where job B runs only after job A completes?"

Interviewer: "Not in the core design. Mention it as an extension."

Good. No DAG execution for now. Each job is independent.

You: "How should the scheduler shut down? Kill running jobs immediately, or wait for them to finish?"

Interviewer: "Graceful shutdown. Stop accepting new jobs, let running jobs complete up to a configurable timeout, then force-stop anything still running."

Two-phase shutdown: soft stop, then hard stop after timeout. The scheduler needs a lifecycle state of its own.

You: "Do we need persistent storage for jobs, or is in-memory enough?"

Interviewer: "In-memory only. Persistence is an extension."

Perfect. You have now clarified scope and ruled out unnecessary complexity.

Final Requirements

Functional Requirements:

schedule(job, trigger, config) adds a job to the scheduler and returns a handle for cancellation
Support four trigger types: one-shot, fixed-rate, fixed-delay, and cron expression
Execute due jobs in a fixed-size thread pool, ordered by next-run-time and priority
Retry failed jobs with exponential backoff up to a per-job max-retry limit
Track each execution through states: PENDING, RUNNING, COMPLETED, FAILED, RETRYING, DEAD
Cancel a scheduled job via its handle; cancelled jobs skip future executions

Non-Functional Requirements:

Thread-safe: concurrent schedule, cancel, and execution calls
Graceful shutdown with configurable timeout
Extensible: adding a new trigger type requires one class, no changes to existing code

Out of Scope:

Persistent job store (extension)
Job dependency graphs / DAG execution (extension)
Distributed scheduling / leader election (extension)
UI dashboard or REST API

Interview tip

Numbering your requirements on the whiteboard makes it easy to reference them later: "This class satisfies requirements 3 and 5." Interviewers love traceable design.

Example Inputs and Outputs

Scenario 1: One-shot reminder email

Input: Schedule sendReminder to run once at 2026-04-10T09:00:00Z with priority 5
Expected: Job enters PENDING. At 09:00 the scheduler picks it up, moves to RUNNING, executes, and transitions to COMPLETED. No reschedule.
Why: Validates one-shot trigger and basic state machine flow

Scenario 2: Fixed-rate metrics aggregation with failure

Input: Schedule aggregateMetrics every 60 seconds, max 3 retries, base delay 2s
Expected: First execution at T+0 succeeds (COMPLETED). At T+60 it fails. State goes FAILED, then RETRYING. Retries at T+62, T+66, T+74 (exponential backoff: 2s, 4s, 8s). If all retries fail, state becomes DEAD. At T+120 the next recurring execution starts fresh with PENDING.
Why: Validates retry with backoff, DEAD state, and recurring reschedule after failure

Scenario 3: Cron-based cleanup with cancellation

Input: Schedule cleanExpiredSessions with cron 0 * * * * (every hour on the hour), priority 1
Expected: At the next hour boundary the job runs. After completion, the scheduler computes the following hour boundary and re-enqueues. Cancelling the handle stops all future executions.
Why: Validates cron trigger computation and cancellation

Try It Yourself

Try it yourself

Before reading the solution, spend 15-20 minutes sketching your own design. Think about: what entity computes the next execution time? How does the scheduler know which job to run next? What data structure gives you the soonest-due job in O(log n)? Compare your approach with the walkthrough below.

Step 1: Identify Core Entities

Start by asking: what are the main "things" in this problem? Scan your requirements for nouns and responsibilities.

"Schedule a job" gives us Job (the work) and JobScheduler (the orchestrator). "Four trigger types" gives us Trigger (the scheduling strategy). "Retry with backoff" needs per-job configuration, so JobConfig wraps retry settings and priority. "Track execution states" gives us JobExecution (one run of a job with its state). "Thread pool" gives us WorkerPool. And schedule() returns a JobHandle for cancellation.

A common mistake is putting everything in one giant Scheduler class. That violates SRP because scheduling logic, execution logic, retry logic, and trigger computation are four separate concerns.

Entity	Responsibility	Key attributes
Job	The unit of work. A functional interface wrapping the task to execute.	`execute()`
Trigger	Computes the next execution time. One implementation per trigger type (Strategy).	`nextFireTime(after)`
JobConfig	Groups a job with its trigger, retry policy, and priority. Immutable value object.	jobId, job, trigger, maxRetries, baseDelay, priority
JobExecution	Represents one scheduled run of a job. Sits in the priority queue. Tracks state.	config, runAt, attemptNumber, state
JobHandle	Token returned at schedule time. Lets the caller cancel future executions.	jobId, cancelled flag
WorkerPool	Manages the fixed-size thread pool. Submits jobs and supports graceful shutdown.	executor, pool size
JobScheduler	The orchestrator. Owns the priority queue, dispatcher thread, and worker pool.	queue, handles, dispatcher

Notice that Trigger is separate from JobConfig. A trigger only knows how to compute "when next?" while JobConfig bundles the trigger with retry and priority settings. This separation means you can reuse triggers across different jobs and test them independently.

Step 2: Define Relationships and Class Design

Class Diagram

Class Interface Derivation

JobScheduler (the orchestrator)

The scheduler is the central coordinator. It owns the run queue, the dispatcher loop, and the worker pool.

Deriving state from requirements:

Requirement	What JobScheduler must track
"Schedule a job and return a handle"	Map of jobId to JobHandle
"Execute due jobs ordered by time and priority"	PriorityQueue of JobExecution
"Fixed-size thread pool"	WorkerPool reference
"Graceful shutdown"	Running flag, shutdown state

Deriving methods from needs:

Need from requirements	Method
"Add a job to the scheduler"	`schedule(config): JobHandle`
"Run the dispatch loop"	`start(): void` (internal)
"Stop accepting and drain"	`shutdown(timeout): void`

JobExecution (the queue entry)

Each execution sits in the priority queue. It must be comparable so the heap orders by (runAt, priority).