Job Scheduler
Design a distributed job scheduler that executes millions of cron-like and one-off jobs reliably across a worker fleet, covering scheduling algorithms, exactly-once execution, failure recovery, and priority queuing.
What is a job scheduler?
A job scheduler fires code at a predefined time or on a recurring schedule. Submit a job definition (a cron expression or a one-off timestamp plus a handler endpoint) and the scheduler executes it at the right moment. The interesting engineering problems are not the scheduling itself; they are preventing double-execution when multiple scheduler nodes run concurrently, recovering jobs that were in-progress when a worker crashed, and maintaining sub-second scheduling precision at 10K executions per second without turning your database into a bottleneck.
Functional Requirements
Core Requirements
- Register jobs with a cron expression or a specific execution time.
- Execute jobs exactly once at (approximately) the scheduled time.
- Retry failed jobs with configurable backoff.
- Provide job status visibility (pending, running, succeeded, failed).
Below the Line
- Workflow DAGs with job dependencies (systems like Airflow or Temporal)
- Real-time streaming jobs
The hardest part in scope: Exactly-once execution at scale across a distributed worker fleet. When multiple scheduler instances poll the same database, each finds the same due rows and publishes duplicate execution records. Solving this cleanly without an external coordinator is the central engineering challenge, and we dedicate a full deep dive to it.
Workflow DAGs are below the line because dependency tracking is a separate layer that sits on top of a scheduler rather than inside it. To add DAG support, I would introduce a Workflow entity holding a directed acyclic graph of job nodes, with a DAG runner that fires each node only after all upstream dependencies complete. That runner submits individual leaf jobs to this scheduler rather than replacing it.
Real-time streaming jobs are below the line because they require continuous, stateful, low-latency processing that is fundamentally different from the discrete "fire once" model here. To add them, I would integrate Apache Flink or Kafka Streams as a separate pipeline alongside this system, not modify the scheduler.
Non-Functional Requirements
Core Requirements
- Scale: 1M active jobs in the system at any time. 10K job executions per second at peak throughput.
- Scheduling precision: Each job fires within 1 second of its scheduled time.
- Availability: 99.99% uptime. If a scheduler node dies, another must take over within seconds, not minutes.
- Delivery semantics: At-least-once delivery is acceptable at the infrastructure layer. Exactly-once is enforced at the application layer via idempotency keys in the handler.
Below the Line
- Sub-100ms scheduling precision
- Geo-distributed scheduling across regions
Sub-100ms scheduling precision is below the line because it forces a fundamentally different architecture: you cannot poll a database fast enough. To achieve it, you would replace the PostgreSQL polling loop with an in-memory timer wheel (like Kafka's or Netty's HashedWheelTimer) running on each scheduler node, pre-loading job fire times into memory and triggering off OS-level timers. That is a specialized real-time system, not a general-purpose scheduler.
Geo-distributed scheduling is below the line because it introduces cross-region clock synchronization, partition-tolerant consensus for job ownership, and the question of whether a job should fire in the region closest to the handler or the region that registered it. To add it, I'd assign each job a "home region" and run independent scheduler fleets per region, with a global control plane for cross-region job migration on failover.
Read/write ratio: Jobs are write-heavy at creation time, then read-heavy as workers poll for ready executions. The interesting pressure is the scheduling fanout: a cron job with a 1-second interval produces 86,400 execution records per day. At 1M active recurring jobs, the system inserts tens of millions of execution records daily while the scheduler continuously scans for due ones. That scan is the hot query, and it determines the index strategy on
job_executions, the caching approach, and why a naive polling loop breaks at scale.
The 1-second precision target constrains the scheduler's polling interval to 500ms or less. A 5-second sleep between polls means a job due at T=1 does not fire until T=5, a 4-second slip that violates the SLA. Polling every 500ms gives enough headroom for query time and network jitter.
The 99.99% availability target means a single scheduler node is not acceptable. That is at most 52 minutes of downtime per year, and a single process restart takes 10-30 seconds. A multi-node scheduler with automatic failover is required.
Core Entities
- Job: The static definition of work to be done. Carries the schedule (cron expression or one-off timestamp), the handler endpoint to invoke, retry policy, and lifecycle status.
- JobExecution: A single invocation of a Job. Tracks which attempt this is, when it ran, which worker claimed it, and whether it succeeded or failed. Each scheduled fire of a Job produces one new JobExecution.
We will revisit schema details, including indexes and partitioning, in the scaling and failure deep dives below. The two entities above are sufficient to drive the API and the HLD.
API Design
FR 1 - Register a job:
POST /jobs
Body: { schedule, handler_endpoint, max_retries, timeout_seconds }
Response: { job_id }
FR 2 - List executions for a job:
GET /jobs/{job_id}/executions?cursor=<execution_id>&limit=50
Response: { executions: [...], next_cursor }
FR 3 - Manually trigger a retry:
POST /jobs/{job_id}/retry
Response: { execution_id }
FR 4 - Get current job status:
GET /jobs/{job_id}
Response: { job_id, status, last_execution_at, next_run_at }
Use POST /jobs because job creation is not idempotent by default: two identical POST requests for the same cron expression create two separate scheduled jobs with distinct job_id values. GET /jobs/{job_id}/executions uses cursor pagination rather than offset pagination because the job_executions table grows unboundedly; OFFSET 100000 requires scanning 100,000 rows before returning anything, which degrades as the table grows. POST /jobs/{job_id}/retry creates a new JobExecution immediately and publishes it to the queue, bypassing the scheduler's polling cycle for fast manual intervention.
High-Level Design
FR 1 and FR 2 - Register and store jobs
The simplest starting point: client registers a job, the App Server validates the cron expression, computes the first next_run_at timestamp, and stores the definition in PostgreSQL.
Request walkthrough:
- Client sends
POST /jobswith a cron expression and handler endpoint. - App Server validates the cron syntax and computes the initial
next_run_at. - App Server inserts a
Jobrow into PostgreSQL withstatus=active. - App Server returns
job_idto the client.
The write path only stores a definition. No execution happens at registration time, and no scheduling machinery is involved yet. The scheduler polls this table in the next step. I'd make this explicit in an interview: "Registration is boring on purpose. All the complexity lives in the execution path."
FR 2 - Execute jobs at the scheduled time (evolve the design)
On a single machine, the right data structure is a min-heap keyed by next_run_at. Pop the minimum entry (the next due job), sleep until that timestamp, execute the job, compute the next scheduled run, push it back. This works well for one process running tens of jobs. I always start here in interviews because it grounds the discussion in something concrete before introducing distributed complexity.
At 10K executions per second, it breaks in three ways. You cannot run jobs inline on the scheduler thread without blocking the next due job. A single scheduler is a single point of failure, and two schedulers for redundancy would fire every job twice.
The core problem: one scheduler cannot know what the other has already enqueued.
The fix is to split responsibilities: the Scheduler only enqueues due jobs to a message queue; Workers pull from the queue and execute. The Scheduler never executes. Workers never decide which jobs are due.
This lets both components scale independently.
What happens when multiple jobs are due at the same millisecond? The scheduler issues a single SELECT WHERE next_run_at <= NOW(). That batch can return thousands of rows. It publishes all of them to the queue in one loop pass. Workers pull concurrently, so the burst fans out across the worker pool instead of serializing on a single thread.
What happens when an urgent job arrives and the scheduler is sleeping? If the scheduler polls every 500ms, the worst-case delay for a newly registered one-off job is 500ms. For jobs that must fire within 1 second of registration, the App Server can publish the execution directly to the queue at registration time, bypassing the polling cycle entirely.
Separating the Scheduler (which only enqueues) from the Worker Pool (which only executes) is the core architectural decision. It eliminates the inline-execution bottleneck and lets each tier scale independently.
Double-execution gotcha: If two Scheduler instances run concurrently and both query for due jobs, they will both find the same rows and publish duplicate execution records. The queue delivers both; two workers execute the same job. This is the central problem in distributed scheduling. We solve it with database-level row claims in Deep Dive 1.
FR 3 - Retry failed jobs
Workers check attempt_number against max_retries on each failure. If under the limit, compute an exponential backoff delay (2^attempt * base_delay), update the execution status to retrying with next_run_at = NOW() + backoff, and let the Scheduler pick it up on the next poll. The Scheduler treats retrying executions identically to new ones: any row with next_run_at <= NOW() gets published.
On failure, the Worker writes the retry state to the database atomically before anything else. If the Worker crashes between execution and the DB write, the execution stays in running status with a stale started_at. A watchdog process reclaims stale-running executions (covered in Deep Dive 2).
FR 4 - Job status visibility
Workers update job_executions on three events: start (status=running, started_at=NOW()), success, and failure. The status API performs an indexed scan by job_id. Add a compound index on (job_id, scheduled_at DESC) to make GET /jobs/{job_id}/executions efficient for cursor pagination, and a partial index on status for the dashboard query that shows all currently-running executions.
Status visibility is a read concern addressed by indexes, not by new components. The job_executions table already captures every state transition; the right compound index is the only addition.
Potential Deep Dives
Deep Dive 1: How do we prevent double-execution when the scheduler is distributed?
This is the hardest problem in the system. Multiple scheduler instances polling the same database find the same due rows and publish duplicate execution records, causing jobs to run twice. I'd frame this for the interviewer immediately: "The core challenge is not scheduling itself, it is preventing two nodes from doing the same work."
Deep Dive 2: How do we handle job execution failures and retries reliably?
The retry logic on Workers is straightforward, but what happens when a Worker crashes between executing the job and writing the failure status to the database? The execution stays in running state forever.
Deep Dive 3: How do we scale to 10K job executions per second?
At 10K executions per second, the scheduler's polling loop becomes the bottleneck. That is 10K UPDATE writes per second just for status transitions, plus SELECT scans, plus Worker DB updates. A naive polling model cannot absorb this without overwhelming PostgreSQL. I'd bring up throughput scaling only after covering correctness (double-execution prevention) and durability (retry handling), because those are more important to get right first.
Final Architecture
The Scheduler's only job is to decide which executions are due and move them into Redis. It never executes a job. Workers never decide what is due; they only consume from Redis and write results to PostgreSQL. This separation is what lets each tier fail and recover independently without data loss.
Interview Cheat Sheet
- Start single-machine. On one machine, keep a min-heap keyed by
next_run_at. Pop the minimum, sleep until that time, execute the job, push the next run back. This makes the data structure concrete for the interviewer before you scale it. - The heap becomes a database index at scale. The PostgreSQL
job_executionstable with an index onnext_run_atis the distributed equivalent of a min-heap. The Redis sorted set (ZADD/ZPOPMIN with score = execute_at) is the high-throughput sub-millisecond equivalent. - Multiple jobs due at the same millisecond. The scheduler issues one
SELECT WHERE next_run_at <= NOW()and gets all of them as a batch. Workers drain the queue concurrently, so a burst of 10K simultaneous jobs fans out across the worker pool instead of serializing. - Urgent job when the scheduler is sleeping. The App Server can publish the execution directly to the queue at registration time for one-off immediate jobs, bypassing the 500ms polling window entirely.
- Double-execution prevention: have three options ready. (1) Single scheduler (SPOF, do not use). (2) Leader election via ZooKeeper (10-30s failover gap). (3) DB row-level claims via
UPDATE SET locked_by=me WHERE locked_by IS NULL(no external coordinator, failure window bounded bylocked_attimeout). - The
locked_at < NOW() - 30sclause is the scheduler's crash recovery. If a scheduler instance claimed rows but crashed before publishing them to the queue, those rows unlock after 30 seconds and another instance claims and publishes them. - Crash recovery for in-progress workers. Any execution with
status=runningandstarted_at < NOW() - timeout_secondsis stale. A watchdog queries for these every 60 seconds and force-fails them, then applies transactional retry logic. - Exactly-once at scale requires two guarantees. At-least-once delivery from the queue plus idempotent handlers. Pass
execution_idto the handler as an idempotency key. The infrastructure gives at-least-once; the handler gives exactly-once semantics end-to-end. - The transactional retry pattern eliminates silent job loss. Atomically updating the current execution to
failedand inserting the next retry record in one DB transaction means no state is lost if the Worker crashes post-commit. The Scheduler picks up the new pending record within 500ms. - Redis sorted set scales to 10K executions per second. The Coordinator pre-loads due jobs 60 seconds ahead via
ZADD score=execute_at. Workers claim atomically via Lua (ZRANGEBYSCORE + ZREM in one round-trip). Workers never read from PostgreSQL for scheduling decisions, reducing DB read load by roughly 100x. - Poll interval vs. precision tradeoff. Polling every 500ms gives 500ms worst-case miss latency and meets the 1-second SLA. Polling every 100ms reduces the worst case to 100ms but increases PostgreSQL query load 5x. Use 500ms as the default; tune down only if precision requirements tighten.
- Use cursor pagination on
job_executions. The table grows unboundedly at high job frequency. Offset-based pagination degrades as the table grows. Cursor onexecution_id(UUID v7 or a sequential ID) is stable and fast regardless of table size. - At-least-once from the queue means Workers must be idempotent. If a Worker crashes after executing the handler but before acknowledging the queue message, the queue redelivers. Use
execution_idas an idempotency key in all downstream writes to make re-execution safe.