Task Coordinator
Design a system that distributes and coordinates large-scale computation across thousands of machines: task decomposition, work assignment, fault tolerance, stragglers, and result aggregation, the core of MapReduce-style pipelines.
What is a distributed task coordinator?
A distributed task coordinator accepts large jobs from clients, partitions each job into smaller tasks, and fans them out to a worker pool running across many machines. The completed results are aggregated and returned to the client. The central challenge is the coordination machinery itself: task assignment without coordinator bottlenecks, worker crash recovery without losing progress, and result aggregation without fan-in collapse at scale.
I always open this question by drawing the coordinator as a single box and then asking: "what happens when this box is slower than the workers?" That one question drives the entire design.
Functional Requirements
Core Requirements
- A client can submit a job specifying an input dataset and a processing function, and receives a job ID in return.
- The system partitions the job into tasks, assigns tasks to available workers, and tracks completion.
- A worker failure is automatically detected and its task is re-assigned to another worker without client involvement.
- When all tasks complete, the system aggregates the results and makes the final output available to the client.
Below the Line (out of scope)
- Priority scheduling across multiple concurrent jobs
- Real-time streaming (as opposed to batch) computation
- Security isolation between tenants sharing the cluster
- Billing and resource quotas per job
The hardest part in scope: Coordinating task assignment at 50,000 tasks per second across 10,000 workers without the coordinator becoming a bottleneck, while detecting and recovering from up to 10% worker failures mid-job without losing progress.
Priority scheduling is below the line because it adds a fair-share queue and preemption machinery orthogonal to the core coordination design. To add it, maintain a separate priority queue of pending jobs and implement a weighted-fair-share scheduler that dequeues jobs by priority weight.
Real-time streaming is out of scope because stream processing requires continuous ingestion and windowed computation, whereas this design targets finite batch jobs with a defined completion. To extend it, replace the task queue with a Kafka consumer group and add windowed aggregation on the worker side.
Non-Functional Requirements
Core Requirements
- Scale: Support jobs with up to 1 million tasks. Up to 10,000 concurrent workers.
- Fault tolerance: Complete a job even if up to 10% of workers fail during execution.
- Coordinator availability: 99.99% uptime. A coordinator failure must not lose in-progress jobs.
- Assignment throughput: Coordinator assigns tasks at minimum 50,000 tasks per second to avoid being the bottleneck.
- Task execution time: Individual task execution time ranges from 100ms to 30 minutes.
- Read/write ratio on the task store: Writes dominate (status updates from workers heartbeating and completing). Batch reads occur when the coordinator scans for expired leases.
Below the Line
- Sub-second coordinator failover (15 seconds is acceptable for this design)
- Cross-region replication of in-progress job state
- Per-task compute isolation (all workers share the same cluster)
Core Entities
- Job: A unit of work submitted by a client. Contains
job_id,input_path(pointer to input data in S3),function_id,status(queued / running / completed / failed), andsubmitted_at. - Task: A subdivision of a job. Contains
task_id,job_id,partition_spec(key range or file chunk),status(pending / in_progress / completed / failed),assigned_worker_id, andlease_expiry_ts. - Worker: A machine that executes tasks. Registers its
worker_idand capacity, heartbeats while alive, and reports task outcomes to the coordinator. - Result: The output produced by a worker for a task. Contains
task_id,output_path(S3 location of partial result), and whether it is a partial or final result.
API Design
Two surfaces exist: the client API (job submission and status) and the worker API (task pull, heartbeat, completion reporting).
// Client: submit a job for distributed execution
POST /jobs
Content-Type: application/json
{
"input_path": "s3://jobs-input/dataset-2024-01.csv",
"function_id": "word_count_v2",
"output_path": "s3://jobs-output/run-abc123/"
}
Response: 201 Created | { "job_id": "job-abc123", "status": "queued" }
// Client: poll job status and retrieve output location
GET /jobs/{job_id}
Response: 200 OK
{
"job_id": "job-abc123",
"status": "completed",
"tasks_total": 1000,
"tasks_completed": 1000,
"output_path": "s3://jobs-output/run-abc123/final"
}
// Worker: pull the next available task
POST /tasks/claim
Authorization: Bearer <worker-token>
Response: 200 OK
{
"task_id": "task-77",
"partition_spec": { "start_byte": 0, "end_byte": 104857600 },
"input_path": "s3://jobs-input/dataset-2024-01.csv",
"lease_ttl_seconds": 30
}
// Worker: extend task lease with a heartbeat
POST /tasks/{task_id}/heartbeat
Authorization: Bearer <worker-token>
Response: 200 OK | 410 Gone (task was reclaimed by coordinator)
// Worker: report task completion
POST /tasks/{task_id}/complete
Authorization: Bearer <worker-token>
{ "output_path": "s3://jobs-output/run-abc123/task-77.part" }
The 410 on heartbeat is the signal to a worker that its lease was reclaimed. The worker must stop executing and discard its partial output. The lease_ttl_seconds in the claim response tells the worker how often to heartbeat (at most every TTL / 3 seconds).
High-Level Design
1. Naive approach: push-based coordinator
The simplest design is a single coordinator that tracks available workers and pushes tasks to them one at a time. The coordinator holds the task queue in memory and drives completion by monitoring worker availability.
Request walkthrough (submit and execute):
- Client sends
POST /jobswith an input path and function ID. - Coordinator reads the input metadata, splits it into N tasks, and stores them as
pendingin the Task Store. - Coordinator finds idle workers and pushes one task to each.
- Worker executes, completes, and notifies the coordinator via RPC.
- Coordinator marks the task
completedand pushes the next pending task to the now-idle worker.
This design is correct for small clusters. The coordinator becomes the bottleneck once the worker count grows. I've seen teams build this exact push loop first without realizing the coordinator serializes every assignment. It works beautifully at 50 workers and falls apart at 5,000.
The coordinator bottleneck problem:
At 10,000 workers completing sub-second tasks, the coordinator must process 10,000 assignments per second through a single loop on a single thread. Network call latency stacks up and the assignment loop serializes what should be parallel decisions. If the coordinator is slow for 10ms, all workers stall simultaneously.
2. Evolved approach: pull-based workers with a distributed task queue
The fix inverts control. Workers pull tasks from a shared queue rather than waiting for the coordinator to push. The coordinator no longer makes per-task assignment decisions: each worker claims its own next task independently, which distributes the assignment decision across all workers.
New components:
- Partition Service: decomposes a submitted job into N task specs and publishes them to the Task Queue.
- Task Queue: a Kafka topic holding pending tasks. Workers pull from it using consumer group semantics.
- Progress Store (Redis): tracks per-task status and lease expiry so the coordinator can report job progress without scanning the full task table.
- Lease Monitor: a background process inside the coordinator that scans for expired leases and re-enqueues stale tasks.
Request walkthrough (evolved):
- Client sends
POST /jobs. - Partition Service splits the input into 1,000 tasks and publishes them to the Task Queue.
- Workers (all 10,000) are polling the queue. Each pulls one task and marks it
in_progressin the Progress Store with a 30-second lease. - Worker executes, writes a partial result to S3, and calls
POST /tasks/{id}/complete. - Coordinator polls the Progress Store; when all tasks are
completed, it triggers the Combiner Fleet.
3. Fault tolerance: heartbeat and lease TTL
A worker that crashes mid-task never reports completion. Its task stays in_progress forever unless the coordinator detects the crash. The lease TTL resolves this: every task has a deadline by which the worker must heartbeat, and the Lease Monitor re-queues any task past its deadline.
When a worker heartbeats, the coordinator extends lease_expiry_ts by TTL seconds in the Progress Store. The Lease Monitor scans for records where lease_expiry_ts < now() and transitions them to pending. Tasks must be idempotent: executing the same task twice must produce the same output without corrupting shared state, since lease expiry guarantees re-execution.
I'd highlight the idempotency requirement early in an interview because it constrains everything downstream. If your tasks have side effects (writing to a shared counter, sending an email), lease-based re-execution will produce duplicates. The interviewer wants to hear you call this out proactively.
4. Result aggregation: two-phase fan-in
Once all tasks complete, the coordinator has 1,000 partial result files in S3. Routing all of them to a single combiner creates a fan-in bottleneck: the combiner would download 1,000 files sequentially. Two-phase aggregation solves this: workers pre-reduce their output locally before flushing to S3, a combiner fleet merges N/M partials in parallel, and a final merge step produces one output file.
I always draw the single-combiner bottleneck first and let the interviewer feel the problem before introducing the two-phase fix. It shows you understand why the naive approach breaks, not just that you memorized the solution.
Potential Deep Dives
1. How do we partition a large job into balanced tasks?
Well-balanced partitions matter because the job completes only when the last task finishes. A single 10x oversized partition creates a straggler that blocks the entire aggregation phase.
2. How do we assign work to 10,000 workers without a coordinator bottleneck?
The coordinator throughput requirement is 50,000 task assignments per second. A push-based coordinator serializes all assignment decisions through one process. Workers pulling from a distributed queue scale assignment throughput horizontally, since each worker makes its own claim decision independently.
3. How do we handle worker crashes and straggler workers?
A crashed worker and a slow worker are both problems. Crashes are detected by missed heartbeats. Stragglers are detected by watching completion rate as a job nears the end.
4. How do we make the coordinator highly available?
The coordinator orchestrates every job transition: partition, queue, monitor, aggregate. A coordinator crash mid-job must not lose task state or stall the worker pool indefinitely.
Final Architecture
The API Gateway routes client requests to the Job API, which creates the job record and notifies the Coordinator Service. The Coordinator holds the leader election lease in etcd, triggers the Partition Service, and drives all job state transitions including launching the Combiner Fleet when all tasks complete. Workers pull tasks from Kafka, heartbeat through the Progress Store (Redis), and write partial results to S3; the Lease Monitor re-queues timed-out tasks.
When all tasks complete, the Combiner Fleet merges N/M partial files in parallel before a final pass produces one output file. Coordinator HA comes from etcd leader election with a hot standby continuously replaying the task log, enabling failover in roughly 10 seconds after a leader crash.
Interview Cheat Sheet
- Partition the input into roughly equal tasks using sample-based range partitioning: sample 1% of the input to estimate data distribution, then place boundaries at even record intervals aligned on record separators.
- Use a pull model over a push model: workers claim tasks from a shared Kafka queue rather than waiting for the coordinator to assign them, which decouples worker throughput from coordinator throughput entirely.
- Work-stealing through Kafka partitions achieves near-linear horizontal scaling: workers pull from their assigned partition and steal batches from idle sibling partitions, rebalancing load without coordinator involvement.
- Every task must carry a lease with a TTL; the Lease Monitor re-enqueues any task that misses its heartbeat deadline, recovering from worker crashes within
TTL + scan_intervalseconds. - Tasks must be idempotent: executing the same task twice must produce the same output without corrupting shared state, because lease expiry always causes re-execution and speculative clones may run in parallel.
- Speculative execution eliminates straggler blocking: once 95% of tasks complete, re-schedule any task running longer than
2x medianon an idle worker and take the first result; discard the duplicate. - Two-phase aggregation removes the fan-in bottleneck: workers write pre-reduced partial results to S3, a combiner fleet merges N/M partials in parallel, and a final merge step combines M combiner outputs into one result.
- Make the coordinator highly available via etcd leader election: the standby continuously replays the task log (WAL) and can take over coordination in roughly 10 seconds after a leader failure (5s TTL expiry plus election plus WAL replay).
- Write all coordinator state transitions to a durable append-only log in etcd so the standby has a complete replay of events up to within milliseconds of the leader's last write.
- Return 410 Gone on a heartbeat when a task lease has been reclaimed; this signals the worker to stop executing immediately and discard its in-progress partial output.
- Tune lease TTL to 2-3x the expected task execution time: too short causes unnecessary re-execution on slow networks; too long delays crash detection and holds up job completion.
- For the design to scale to 1 million tasks, the coordinator must never hold per-task state in memory; store all task state in the Progress Store (Redis) and let the coordinator only orchestrate transitions.