Data Migration

What is a large-scale data migration system?

A data migration system copies data from one infrastructure (typically on-premises databases) to another (typically cloud object storage or a managed database), verifying that nothing was lost or corrupted in transit. The key engineering challenge is not the copying itself. It is verifying integrity at petabyte scale, limiting the read load placed on production source systems, and recovering from machine failures that happen days into a multi-week job without restarting from scratch.

I like this question because every senior engineer has lived through at least one painful migration. It tests distributed coordination, idempotency, checkpointing, and throughput optimization all at once, and it exposes whether a candidate has real operational experience or only textbook knowledge.

Functional Requirements

Core Requirements

Copy petabytes of data from source systems to the target cloud storage.
Verify that every byte was transferred without corruption.
Tolerate and resume from machine failures without re-transferring data already migrated.
Minimize impact on the production systems being read from.

Below the Line (out of scope)

Schema transformation / ETL. We assume source and target schemas are compatible. Real migrations often require field remapping, type coercion, or format changes. To add ETL, you'd insert a transformation worker between the reader and the writer, applying a schema map that's versioned alongside the migration job. Keeping that transformation logic correct across billions of rows is an article in itself.
Change data capture (keeping source and target in sync after cutover). Once the bulk migration completes, new writes to the source create drift. CDC pipelines like Debezium or DMS handle this, but they belong to a post-migration sync phase, not the bulk transfer system we're designing.

I'd call out both of these scope boundaries explicitly in an interview. Interviewers often probe whether you know about CDC, and naming it as out of scope demonstrates awareness without letting it derail the core design.

The hardest part in scope: Verifying that every byte arrived correctly is the single hardest engineering challenge here. Counting rows is not enough. Computing checksums per chunk and comparing them end-to-end, at petabyte scale, without re-reading the entire source, is where most interview candidates get stuck. We'll dedicate a full deep dive to it.

Non-Functional Requirements

Core Requirements

Scale: 1 PB of data migrated over 30 days. That averages roughly 400 GB/hour sustained throughput across the fleet.
Throughput per worker: Each transfer worker sustains 200 MB/s read from source and 150 MB/s write to target (roughly 720 GB/hr peak per worker). At 50 parallel workers, aggregate read throughput reaches ~10 GB/s, providing 25x headroom above the 400 GB/hr sustained rate required to complete 1 PB in 30 days.
Data integrity: Zero bytes lost or corrupted. Every chunk is verified with a SHA-256 checksum computed on the source before transfer and recomputed on the target after write.
Production impact: Total read rate on the source is capped at 10% of peak production query load. Exceeding this would starve live traffic and is a hard constraint, not a nice-to-have.
Resumability: After any failure, the coordinator detects the lost worker within 30 seconds and reassigns the task within 1 minute. No chunk that was successfully written and verified is re-transferred.

Below the Line

Sub-second migration latency (this is a batch system, not a streaming one)
Multi-region active-active replication during migration
Automatic schema compatibility validation

Read/write ratio: A migration is fundamentally read-heavy on the source and write-heavy on the target. For every 1 byte read from source, exactly 1 byte is written to target, plus roughly 1 additional read on the target for verification. This 1:1:1 read/write/verify pattern means source capacity is the binding constraint for throughput, not target write bandwidth. That shapes the entire architecture: the coordinator's job is to minimize how many times we touch the source, not the target.

Core Entities

MigrationJob: The top-level unit representing a single migration run. Tracks source location, target location, throughput limit, overall status (pending, running, paused, complete, failed), and job-level metadata like start time and estimated completion.
MigrationChunk: A contiguous slice of data within the job, described by a byte range or row range. Each chunk carries its SHA-256 source checksum, transfer status (pending, in-progress, verified, failed), and the worker ID that last attempted it.
WorkerTask: A unit of work dispatched to a specific transfer worker. Maps one chunk to one worker, with a lease expiry time. If the worker does not heartbeat before expiry, the task is reassigned.
IntegrityReport: The result of comparing source and target checksums for a chunk or the full job. Stores both checksums, comparison result, and timestamp. This is the audit trail that proves the migration completed correctly.
MigrationCheckpoint: The last durably confirmed progress marker for a job. Written after each chunk is verified. Enables the coordinator to reconstruct in-flight state after a crash without re-scanning the entire job manifest.

Schema details belong in the data model deep dive. These five entities are enough to reason about the full design.

API Design

One endpoint per core functional requirement. The naive shapes work for small migrations, and we'll show where they break at scale.

FR 1: Submit a migration job:

Naive shape:

POST /migrations
Body: { source_path, target_path }
Response: { job_id, status: "pending" }

This works but gives the coordinator no signal about how aggressively to read from the source. A 50-worker migration targeting a production OLTP database would immediately saturate it. The evolved shape adds explicit throttle parameters:

POST /migrations
Body: {
  source_path,
  target_path,
  throughput_limit_gb_per_hour,   // e.g. 400
  max_parallel_workers,           // e.g. 50
  priority: "low" | "normal" | "high"
}
Response: { job_id, status: "pending", estimated_completion_at }

The priority field lets the coordinator schedule off-peak work below other jobs sharing the same source. The coordinator uses throughput_limit_gb_per_hour to issue tokens to workers via a rate limiter.

FR 2: Poll job status and progress:

GET /migrations/{job_id}
Response: {
  job_id,
  status,
  total_bytes,
  bytes_transferred,
  bytes_verified,
  chunks_total,
  chunks_complete,
  estimated_completion_at,
  last_failure_reason
}

Status polling is preferable to webhooks here because the migration client is often a one-off ops script, not a long-running service. The response includes both bytes_transferred and bytes_verified separately, because a chunk that transferred but failed checksum verification must be retried.

FR 3: Fetch the integrity verification report:

GET /migrations/{job_id}/report
Response: {
  job_id,
  overall_status: "pass" | "fail" | "in_progress",
  chunks_verified,
  chunks_failed,
  failed_chunks: [{ chunk_id, source_checksum, target_checksum }]
}

This is a separate endpoint from job status because the report is the compliance artifact. Ops teams and auditors query it after cutover to confirm data fidelity. Pagination applies when chunks_failed is large.

High-Level Design

Start with the simplest system that satisfies one requirement, then evolve.

1. Copy data from source to target

The core data path: a client submits a job, a coordinator splits the source into chunks, and workers transfer each chunk to the target. I'd start an interview answer exactly here, with the simplest possible system that moves bytes. Verification, throttling, and fault tolerance all come later.

Components:

Client: The operator or automation script that calls POST /migrations.
Migration Coordinator: Receives the job, discovers the source layout (file tree or row ranges), and subdivides it into chunks. Writes the chunk manifest to the jobs database. Dispatches tasks to workers.
Transfer Workers: Stateless processes. Each worker pulls a task, reads the assigned chunk from the source, and writes it to the target.
Jobs DB: Durable store for job metadata and chunk state. Postgres works well at this scale.
Source System: The origin (on-prem NFS, legacy database, etc.). Workers read from it.
Target Storage: Cloud object storage (S3, GCS) or a managed database. Workers write to it.

What is a large-scale data migration system?

Functional Requirements

Core Requirements

Copy petabytes of data from source systems to the target cloud storage.
Verify that every byte was transferred without corruption.
Tolerate and resume from machine failures without re-transferring data already migrated.
Minimize impact on the production systems being read from.

Below the Line (out of scope)

Schema transformation / ETL. We assume source and target schemas are compatible. Real migrations often require field remapping, type coercion, or format changes. To add ETL, you'd insert a transformation worker between the reader and the writer, applying a schema map that's versioned alongside the migration job. Keeping that transformation logic correct across billions of rows is an article in itself.
Change data capture (keeping source and target in sync after cutover). Once the bulk migration completes, new writes to the source create drift. CDC pipelines like Debezium or DMS handle this, but they belong to a post-migration sync phase, not the bulk transfer system we're designing.

The hardest part in scope: Verifying that every byte arrived correctly is the single hardest engineering challenge here. Counting rows is not enough. Computing checksums per chunk and comparing them end-to-end, at petabyte scale, without re-reading the entire source, is where most interview candidates get stuck. We'll dedicate a full deep dive to it.

Non-Functional Requirements

Core Requirements

Scale: 1 PB of data migrated over 30 days. That averages roughly 400 GB/hour sustained throughput across the fleet.
Throughput per worker: Each transfer worker sustains 200 MB/s read from source and 150 MB/s write to target (roughly 720 GB/hr peak per worker). At 50 parallel workers, aggregate read throughput reaches ~10 GB/s, providing 25x headroom above the 400 GB/hr sustained rate required to complete 1 PB in 30 days.
Data integrity: Zero bytes lost or corrupted. Every chunk is verified with a SHA-256 checksum computed on the source before transfer and recomputed on the target after write.
Production impact: Total read rate on the source is capped at 10% of peak production query load. Exceeding this would starve live traffic and is a hard constraint, not a nice-to-have.
Resumability: After any failure, the coordinator detects the lost worker within 30 seconds and reassigns the task within 1 minute. No chunk that was successfully written and verified is re-transferred.

Below the Line

Sub-second migration latency (this is a batch system, not a streaming one)
Multi-region active-active replication during migration
Automatic schema compatibility validation

Read/write ratio: A migration is fundamentally read-heavy on the source and write-heavy on the target. For every 1 byte read from source, exactly 1 byte is written to target, plus roughly 1 additional read on the target for verification. This 1:1:1 read/write/verify pattern means source capacity is the binding constraint for throughput, not target write bandwidth. That shapes the entire architecture: the coordinator's job is to minimize how many times we touch the source, not the target.

Core Entities

MigrationJob: The top-level unit representing a single migration run. Tracks source location, target location, throughput limit, overall status (pending, running, paused, complete, failed), and job-level metadata like start time and estimated completion.
MigrationChunk: A contiguous slice of data within the job, described by a byte range or row range. Each chunk carries its SHA-256 source checksum, transfer status (pending, in-progress, verified, failed), and the worker ID that last attempted it.
WorkerTask: A unit of work dispatched to a specific transfer worker. Maps one chunk to one worker, with a lease expiry time. If the worker does not heartbeat before expiry, the task is reassigned.
IntegrityReport: The result of comparing source and target checksums for a chunk or the full job. Stores both checksums, comparison result, and timestamp. This is the audit trail that proves the migration completed correctly.
MigrationCheckpoint: The last durably confirmed progress marker for a job. Written after each chunk is verified. Enables the coordinator to reconstruct in-flight state after a crash without re-scanning the entire job manifest.

Schema details belong in the data model deep dive. These five entities are enough to reason about the full design.

API Design

One endpoint per core functional requirement. The naive shapes work for small migrations, and we'll show where they break at scale.

FR 1: Submit a migration job:

Naive shape:

POST /migrations
Body: { source_path, target_path }
Response: { job_id, status: "pending" }

POST /migrations
Body: {
  source_path,
  target_path,
  throughput_limit_gb_per_hour,   // e.g. 400
  max_parallel_workers,           // e.g. 50
  priority: "low" | "normal" | "high"
}
Response: { job_id, status: "pending", estimated_completion_at }

FR 2: Poll job status and progress:

GET /migrations/{job_id}
Response: {
  job_id,
  status,
  total_bytes,
  bytes_transferred,
  bytes_verified,
  chunks_total,
  chunks_complete,
  estimated_completion_at,
  last_failure_reason
}

FR 3: Fetch the integrity verification report:

GET /migrations/{job_id}/report
Response: {
  job_id,
  overall_status: "pass" | "fail" | "in_progress",
  chunks_verified,
  chunks_failed,
  failed_chunks: [{ chunk_id, source_checksum, target_checksum }]
}

High-Level Design

Start with the simplest system that satisfies one requirement, then evolve.

1. Copy data from source to target

Components:

Client: The operator or automation script that calls POST /migrations.
Migration Coordinator: Receives the job, discovers the source layout (file tree or row ranges), and subdivides it into chunks. Writes the chunk manifest to the jobs database. Dispatches tasks to workers.
Transfer Workers: Stateless processes. Each worker pulls a task, reads the assigned chunk from the source, and writes it to the target.
Jobs DB: Durable store for job metadata and chunk state. Postgres works well at this scale.
Source System: The origin (on-prem NFS, legacy database, etc.). Workers read from it.
Target Storage: Cloud object storage (S3, GCS) or a managed database. Workers write to it.

Data Migration

What is a large-scale data migration system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Copy data from source to target

Continue Reading with Premium

Comments

Data Migration

What is a large-scale data migration system?

Functional Requirements

Core Requirements

Below the Line (out of scope)

Non-Functional Requirements

Core Requirements

Below the Line

Core Entities

API Design

High-Level Design

1. Copy data from source to target

Continue Reading with Premium

Comments