Task Coordinator

What is a distributed task coordinator?

A distributed task coordinator accepts large jobs from clients, partitions each job into smaller tasks, and fans them out to a worker pool running across many machines. The completed results are aggregated and returned to the client. The central challenge is the coordination machinery itself: task assignment without coordinator bottlenecks, worker crash recovery without losing progress, and result aggregation without fan-in collapse at scale.

I always open this question by drawing the coordinator as a single box and then asking: "what happens when this box is slower than the workers?" That one question drives the entire design.

Functional Requirements

Core Requirements

A client can submit a job specifying an input dataset and a processing function, and receives a job ID in return.
The system partitions the job into tasks, assigns tasks to available workers, and tracks completion.
A worker failure is automatically detected and its task is re-assigned to another worker without client involvement.
When all tasks complete, the system aggregates the results and makes the final output available to the client.

Below the Line (out of scope)

Priority scheduling across multiple concurrent jobs
Real-time streaming (as opposed to batch) computation
Security isolation between tenants sharing the cluster
Billing and resource quotas per job

The hardest part in scope: Coordinating task assignment at 50,000 tasks per second across 10,000 workers without the coordinator becoming a bottleneck, while detecting and recovering from up to 10% worker failures mid-job without losing progress.

Priority scheduling is below the line because it adds a fair-share queue and preemption machinery orthogonal to the core coordination design. To add it, maintain a separate priority queue of pending jobs and implement a weighted-fair-share scheduler that dequeues jobs by priority weight.

Real-time streaming is out of scope because stream processing requires continuous ingestion and windowed computation, whereas this design targets finite batch jobs with a defined completion. To extend it, replace the task queue with a Kafka consumer group and add windowed aggregation on the worker side.

Non-Functional Requirements

Core Requirements

Scale: Support jobs with up to 1 million tasks. Up to 10,000 concurrent workers.
Fault tolerance: Complete a job even if up to 10% of workers fail during execution.
Coordinator availability: 99.99% uptime. A coordinator failure must not lose in-progress jobs.
Assignment throughput: Coordinator assigns tasks at minimum 50,000 tasks per second to avoid being the bottleneck.
Task execution time: Individual task execution time ranges from 100ms to 30 minutes.
Read/write ratio on the task store: Writes dominate (status updates from workers heartbeating and completing). Batch reads occur when the coordinator scans for expired leases.

Below the Line

Sub-second coordinator failover (15 seconds is acceptable for this design)
Cross-region replication of in-progress job state
Per-task compute isolation (all workers share the same cluster)

Core Entities

Job: A unit of work submitted by a client. Contains job_id, input_path (pointer to input data in S3), function_id, status (queued / running / completed / failed), and submitted_at.
Task: A subdivision of a job. Contains task_id, job_id, partition_spec (key range or file chunk), status (pending / in_progress / completed / failed), assigned_worker_id, and lease_expiry_ts.
Worker: A machine that executes tasks. Registers its worker_id and capacity, heartbeats while alive, and reports task outcomes to the coordinator.
Result: The output produced by a worker for a task. Contains task_id, output_path (S3 location of partial result), and whether it is a partial or final result.

API Design

Two surfaces exist: the client API (job submission and status) and the worker API (task pull, heartbeat, completion reporting).

// Client: submit a job for distributed execution
POST /jobs
Content-Type: application/json

{
  "input_path": "s3://jobs-input/dataset-2024-01.csv",
  "function_id": "word_count_v2",
  "output_path": "s3://jobs-output/run-abc123/"
}

Response: 201 Created | { "job_id": "job-abc123", "status": "queued" }

// Client: poll job status and retrieve output location
GET /jobs/{job_id}

Response: 200 OK
{
  "job_id": "job-abc123",
  "status": "completed",
  "tasks_total": 1000,
  "tasks_completed": 1000,
  "output_path": "s3://jobs-output/run-abc123/final"
}

// Worker: pull the next available task
POST /tasks/claim
Authorization: Bearer <worker-token>

Response: 200 OK
{
  "task_id": "task-77",
  "partition_spec": { "start_byte": 0, "end_byte": 104857600 },
  "input_path": "s3://jobs-input/dataset-2024-01.csv",
  "lease_ttl_seconds": 30
}

// Worker: extend task lease with a heartbeat
POST /tasks/{task_id}/heartbeat
Authorization: Bearer <worker-token>

Response: 200 OK | 410 Gone (task was reclaimed by coordinator)

// Worker: report task completion
POST /tasks/{task_id}/complete
Authorization: Bearer <worker-token>

{ "output_path": "s3://jobs-output/run-abc123/task-77.part" }

The 410 on heartbeat is the signal to a worker that its lease was reclaimed. The worker must stop executing and discard its partial output. The lease_ttl_seconds in the claim response tells the worker how often to heartbeat (at most every TTL / 3 seconds).

High-Level Design

1. Naive approach: push-based coordinator

The simplest design is a single coordinator that tracks available workers and pushes tasks to them one at a time. The coordinator holds the task queue in memory and drives completion by monitoring worker availability.

Request walkthrough (submit and execute):

Client sends POST /jobs with an input path and function ID.
Coordinator reads the input metadata, splits it into N tasks, and stores them as pending in the Task Store.
Coordinator finds idle workers and pushes one task to each.
Worker executes, completes, and notifies the coordinator via RPC.
Coordinator marks the task completed and pushes the next pending task to the now-idle worker.

This design is correct for small clusters. The coordinator becomes the bottleneck once the worker count grows. I've seen teams build this exact push loop first without realizing the coordinator serializes every assignment. It works beautifully at 50 workers and falls apart at 5,000.

The coordinator bottleneck problem:

At 10,000 workers completing sub-second tasks, the coordinator must process 10,000 assignments per second through a single loop on a single thread. Network call latency stacks up and the assignment loop serializes what should be parallel decisions. If the coordinator is slow for 10ms, all workers stall simultaneously.

2. Evolved approach: pull-based workers with a distributed task queue

The fix inverts control. Workers pull tasks from a shared queue rather than waiting for the coordinator to push. The coordinator no longer makes per-task assignment decisions: each worker claims its own next task independently, which distributes the assignment decision across all workers.

New components:

Partition Service: decomposes a submitted job into N task specs and publishes them to the Task Queue.
Task Queue: a Kafka topic holding pending tasks. Workers pull from it using consumer group semantics.
Progress Store (Redis): tracks per-task status and lease expiry so the coordinator can report job progress without scanning the full task table.
Lease Monitor: a background process inside the coordinator that scans for expired leases and re-enqueues stale tasks.

Request walkthrough (evolved):

Client sends POST /jobs.
Partition Service splits the input into 1,000 tasks and publishes them to the Task Queue.
Workers (all 10,000) are polling the queue. Each pulls one task and marks it in_progress in the Progress Store with a 30-second lease.
Worker executes, writes a partial result to S3, and calls POST /tasks/{id}/complete.
Coordinator polls the Progress Store; when all tasks are completed, it triggers the Combiner Fleet.

A client can submit a job specifying an input dataset and a processing function, and receives a job ID in return.
The system partitions the job into tasks, assigns tasks to available workers, and tracks completion.
A worker failure is automatically detected and its task is re-assigned to another worker without client involvement.
When all tasks complete, the system aggregates the results and makes the final output available to the client.

Below the Line (out of scope)

Priority scheduling across multiple concurrent jobs
Real-time streaming (as opposed to batch) computation
Security isolation between tenants sharing the cluster
Billing and resource quotas per job

The hardest part in scope: Coordinating task assignment at 50,000 tasks per second across 10,000 workers without the coordinator becoming a bottleneck, while detecting and recovering from up to 10% worker failures mid-job without losing progress.

Non-Functional Requirements

Core Requirements

Scale: Support jobs with up to 1 million tasks. Up to 10,000 concurrent workers.
Fault tolerance: Complete a job even if up to 10% of workers fail during execution.
Coordinator availability: 99.99% uptime. A coordinator failure must not lose in-progress jobs.
Assignment throughput: Coordinator assigns tasks at minimum 50,000 tasks per second to avoid being the bottleneck.
Task execution time: Individual task execution time ranges from 100ms to 30 minutes.
Read/write ratio on the task store: Writes dominate (status updates from workers heartbeating and completing). Batch reads occur when the coordinator scans for expired leases.

Below the Line

Sub-second coordinator failover (15 seconds is acceptable for this design)
Cross-region replication of in-progress job state
Per-task compute isolation (all workers share the same cluster)

Core Entities

Job: A unit of work submitted by a client. Contains job_id, input_path (pointer to input data in S3), function_id, status (queued / running / completed / failed), and submitted_at.
Task: A subdivision of a job. Contains task_id, job_id, partition_spec (key range or file chunk), status (pending / in_progress / completed / failed), assigned_worker_id, and lease_expiry_ts.
Worker: A machine that executes tasks. Registers its worker_id and capacity, heartbeats while alive, and reports task outcomes to the coordinator.
Result: The output produced by a worker for a task. Contains task_id, output_path (S3 location of partial result), and whether it is a partial or final result.

API Design

Two surfaces exist: the client API (job submission and status) and the worker API (task pull, heartbeat, completion reporting).

// Client: submit a job for distributed execution
POST /jobs
Content-Type: application/json

{
  "input_path": "s3://jobs-input/dataset-2024-01.csv",
  "function_id": "word_count_v2",
  "output_path": "s3://jobs-output/run-abc123/"
}

Response: 201 Created | { "job_id": "job-abc123", "status": "queued" }

// Client: poll job status and retrieve output location
GET /jobs/{job_id}

Response: 200 OK
{
  "job_id": "job-abc123",
  "status": "completed",
  "tasks_total": 1000,
  "tasks_completed": 1000,
  "output_path": "s3://jobs-output/run-abc123/final"
}

// Worker: pull the next available task
POST /tasks/claim
Authorization: Bearer <worker-token>

Response: 200 OK
{
  "task_id": "task-77",
  "partition_spec": { "start_byte": 0, "end_byte": 104857600 },
  "input_path": "s3://jobs-input/dataset-2024-01.csv",
  "lease_ttl_seconds": 30
}

// Worker: extend task lease with a heartbeat
POST /tasks/{task_id}/heartbeat
Authorization: Bearer <worker-token>

Response: 200 OK | 410 Gone (task was reclaimed by coordinator)

// Worker: report task completion
POST /tasks/{task_id}/complete
Authorization: Bearer <worker-token>

{ "output_path": "s3://jobs-output/run-abc123/task-77.part" }

High-Level Design

1. Naive approach: push-based coordinator

Request walkthrough (submit and execute):

Client sends POST /jobs with an input path and function ID.
Coordinator reads the input metadata, splits it into N tasks, and stores them as pending in the Task Store.
Coordinator finds idle workers and pushes one task to each.
Worker executes, completes, and notifies the coordinator via RPC.
Coordinator marks the task completed and pushes the next pending task to the now-idle worker.

The coordinator bottleneck problem:

2. Evolved approach: pull-based workers with a distributed task queue

New components:

Partition Service: decomposes a submitted job into N task specs and publishes them to the Task Queue.
Task Queue: a Kafka topic holding pending tasks. Workers pull from it using consumer group semantics.
Progress Store (Redis): tracks per-task status and lease expiry so the coordinator can report job progress without scanning the full task table.
Lease Monitor: a background process inside the coordinator that scans for expired leases and re-enqueues stale tasks.

Request walkthrough (evolved):

Client sends POST /jobs.
Partition Service splits the input into 1,000 tasks and publishes them to the Task Queue.
Workers (all 10,000) are polling the queue. Each pulls one task and marks it in_progress in the Progress Store with a 30-second lease.
Worker executes, writes a partial result to S3, and calls POST /tasks/{id}/complete.
Coordinator polls the Progress Store; when all tasks are completed, it triggers the Combiner Fleet.

Task Coordinator

What is a distributed task coordinator?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Naive approach: push-based coordinator

2. Evolved approach: pull-based workers with a distributed task queue

Continue Reading with Premium

Comments

Task Coordinator

What is a distributed task coordinator?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Naive approach: push-based coordinator

2. Evolved approach: pull-based workers with a distributed task queue

Continue Reading with Premium

Comments