Job Scheduler
Design a distributed job scheduler that executes millions of cron-like and one-off jobs reliably across a worker fleet, covering scheduling algorithms, exactly-once execution, failure recovery, and priority queuing.
What is a job scheduler?
A job scheduler fires code at a predefined time or on a recurring schedule. Submit a job definition (a cron expression or a one-off timestamp plus a handler endpoint) and the scheduler executes it at the right moment. The interesting engineering problems are not the scheduling itself; they are preventing double-execution when multiple scheduler nodes run concurrently, recovering jobs that were in-progress when a worker crashed, and maintaining sub-second scheduling precision at 10K executions per second without turning your database into a bottleneck.
Functional Requirements
Core Requirements
- Register jobs with a cron expression or a specific execution time.
- Execute jobs exactly once at (approximately) the scheduled time.
- Retry failed jobs with configurable backoff.
- Provide job status visibility (pending, running, succeeded, failed).
Below the Line
- Workflow DAGs with job dependencies (systems like Airflow or Temporal)
- Real-time streaming jobs
The hardest part in scope: Exactly-once execution at scale across a distributed worker fleet. When multiple scheduler instances poll the same database, each finds the same due rows and publishes duplicate execution records. Solving this cleanly without an external coordinator is the central engineering challenge, and we dedicate a full deep dive to it.
Workflow DAGs are below the line because dependency tracking is a separate layer that sits on top of a scheduler rather than inside it. To add DAG support, I would introduce a Workflow entity holding a directed acyclic graph of job nodes, with a DAG runner that fires each node only after all upstream dependencies complete. That runner submits individual leaf jobs to this scheduler rather than replacing it.
Real-time streaming jobs are below the line because they require continuous, stateful, low-latency processing that is fundamentally different from the discrete "fire once" model here. To add them, I would integrate Apache Flink or Kafka Streams as a separate pipeline alongside this system, not modify the scheduler.
Non-Functional Requirements
Core Requirements
- Scale: 1M active jobs in the system at any time. 10K job executions per second at peak throughput.
- Scheduling precision: Each job fires within 1 second of its scheduled time.
- Availability: 99.99% uptime. If a scheduler node dies, another must take over within seconds, not minutes.
- Delivery semantics: At-least-once delivery is acceptable at the infrastructure layer. Exactly-once is enforced at the application layer via idempotency keys in the handler.
Below the Line
- Sub-100ms scheduling precision
- Geo-distributed scheduling across regions
Sub-100ms scheduling precision is below the line because it forces a fundamentally different architecture: you cannot poll a database fast enough. To achieve it, you would replace the PostgreSQL polling loop with an in-memory timer wheel (like Kafka's or Netty's HashedWheelTimer) running on each scheduler node, pre-loading job fire times into memory and triggering off OS-level timers. That is a specialized real-time system, not a general-purpose scheduler.
Geo-distributed scheduling is below the line because it introduces cross-region clock synchronization, partition-tolerant consensus for job ownership, and the question of whether a job should fire in the region closest to the handler or the region that registered it. To add it, I'd assign each job a "home region" and run independent scheduler fleets per region, with a global control plane for cross-region job migration on failover.
Read/write ratio: Jobs are write-heavy at creation time, then read-heavy as workers poll for ready executions. The interesting pressure is the scheduling fanout: a cron job with a 1-second interval produces 86,400 execution records per day. At 1M active recurring jobs, the system inserts tens of millions of execution records daily while the scheduler continuously scans for due ones. That scan is the hot query, and it determines the index strategy on
job_executions, the caching approach, and why a naive polling loop breaks at scale.
The 1-second precision target constrains the scheduler's polling interval to 500ms or less. A 5-second sleep between polls means a job due at T=1 does not fire until T=5, a 4-second slip that violates the SLA. Polling every 500ms gives enough headroom for query time and network jitter.
The 99.99% availability target means a single scheduler node is not acceptable. That is at most 52 minutes of downtime per year, and a single process restart takes 10-30 seconds. A multi-node scheduler with automatic failover is required.
Core Entities
- Job: The static definition of work to be done. Carries the schedule (cron expression or one-off timestamp), the handler endpoint to invoke, retry policy, and lifecycle status.
- JobExecution: A single invocation of a Job. Tracks which attempt this is, when it ran, which worker claimed it, and whether it succeeded or failed. Each scheduled fire of a Job produces one new JobExecution.
We will revisit schema details, including indexes and partitioning, in the scaling and failure deep dives below. The two entities above are sufficient to drive the API and the HLD.
API Design
FR 1 - Register a job:
POST /jobs
Body: { schedule, handler_endpoint, max_retries, timeout_seconds }
Response: { job_id }
FR 2 - List executions for a job:
GET /jobs/{job_id}/executions?cursor=<execution_id>&limit=50
Response: { executions: [...], next_cursor }
FR 3 - Manually trigger a retry:
POST /jobs/{job_id}/retry
Response: { execution_id }
FR 4 - Get current job status:
GET /jobs/{job_id}
Response: { job_id, status, last_execution_at, next_run_at }
Use POST /jobs because job creation is not idempotent by default: two identical POST requests for the same cron expression create two separate scheduled jobs with distinct job_id values. GET /jobs/{job_id}/executions uses cursor pagination rather than offset pagination because the job_executions table grows unboundedly; OFFSET 100000 requires scanning 100,000 rows before returning anything, which degrades as the table grows. POST /jobs/{job_id}/retry creates a new JobExecution immediately and publishes it to the queue, bypassing the scheduler's polling cycle for fast manual intervention.
High-Level Design
FR 1 and FR 2 - Register and store jobs
The simplest starting point: client registers a job, the App Server validates the cron expression, computes the first next_run_at timestamp, and stores the definition in PostgreSQL.
Request walkthrough:
- Client sends
POST /jobswith a cron expression and handler endpoint. - App Server validates the cron syntax and computes the initial
next_run_at. - App Server inserts a
Jobrow into PostgreSQL withstatus=active. - App Server returns
job_idto the client.
The write path only stores a definition. No execution happens at registration time, and no scheduling machinery is involved yet. The scheduler polls this table in the next step. I'd make this explicit in an interview: "Registration is boring on purpose. All the complexity lives in the execution path."
FR 2 - Execute jobs at the scheduled time (evolve the design)
On a single machine, the right data structure is a min-heap keyed by next_run_at. Pop the minimum entry (the next due job), sleep until that timestamp, execute the job, compute the next scheduled run, push it back. This works well for one process running tens of jobs. I always start here in interviews because it grounds the discussion in something concrete before introducing distributed complexity.
At 10K executions per second, it breaks in three ways. You cannot run jobs inline on the scheduler thread without blocking the next due job. A single scheduler is a single point of failure, and two schedulers for redundancy would fire every job twice.
The core problem: one scheduler cannot know what the other has already enqueued.
The fix is to split responsibilities: the Scheduler only enqueues due jobs to a message queue; Workers pull from the queue and execute. The Scheduler never executes. Workers never decide which jobs are due.
This lets both components scale independently.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.