Data Migration
Design a system that migrates petabytes of data from on-premises infrastructure to the cloud with zero data loss, minimal downtime, integrity verification, and the ability to resume after failures.
What is a large-scale data migration system?
A data migration system copies data from one infrastructure (typically on-premises databases) to another (typically cloud object storage or a managed database), verifying that nothing was lost or corrupted in transit. The key engineering challenge is not the copying itself. It is verifying integrity at petabyte scale, limiting the read load placed on production source systems, and recovering from machine failures that happen days into a multi-week job without restarting from scratch.
I like this question because every senior engineer has lived through at least one painful migration. It tests distributed coordination, idempotency, checkpointing, and throughput optimization all at once, and it exposes whether a candidate has real operational experience or only textbook knowledge.
Functional Requirements
Core Requirements
- Copy petabytes of data from source systems to the target cloud storage.
- Verify that every byte was transferred without corruption.
- Tolerate and resume from machine failures without re-transferring data already migrated.
- Minimize impact on the production systems being read from.
Below the Line (out of scope)
- Schema transformation / ETL. We assume source and target schemas are compatible. Real migrations often require field remapping, type coercion, or format changes. To add ETL, you'd insert a transformation worker between the reader and the writer, applying a schema map that's versioned alongside the migration job. Keeping that transformation logic correct across billions of rows is an article in itself.
- Change data capture (keeping source and target in sync after cutover). Once the bulk migration completes, new writes to the source create drift. CDC pipelines like Debezium or DMS handle this, but they belong to a post-migration sync phase, not the bulk transfer system we're designing.
I'd call out both of these scope boundaries explicitly in an interview. Interviewers often probe whether you know about CDC, and naming it as out of scope demonstrates awareness without letting it derail the core design.
The hardest part in scope: Verifying that every byte arrived correctly is the single hardest engineering challenge here. Counting rows is not enough. Computing checksums per chunk and comparing them end-to-end, at petabyte scale, without re-reading the entire source, is where most interview candidates get stuck. We'll dedicate a full deep dive to it.
Non-Functional Requirements
Core Requirements
- Scale: 1 PB of data migrated over 30 days. That averages roughly 400 GB/hour sustained throughput across the fleet.
- Throughput per worker: Each transfer worker sustains 200 MB/s read from source and 150 MB/s write to target (roughly 720 GB/hr peak per worker). At 50 parallel workers, aggregate read throughput reaches ~10 GB/s, providing 25x headroom above the 400 GB/hr sustained rate required to complete 1 PB in 30 days.
- Data integrity: Zero bytes lost or corrupted. Every chunk is verified with a SHA-256 checksum computed on the source before transfer and recomputed on the target after write.
- Production impact: Total read rate on the source is capped at 10% of peak production query load. Exceeding this would starve live traffic and is a hard constraint, not a nice-to-have.
- Resumability: After any failure, the coordinator detects the lost worker within 30 seconds and reassigns the task within 1 minute. No chunk that was successfully written and verified is re-transferred.
Below the Line
- Sub-second migration latency (this is a batch system, not a streaming one)
- Multi-region active-active replication during migration
- Automatic schema compatibility validation
Read/write ratio: A migration is fundamentally read-heavy on the source and write-heavy on the target. For every 1 byte read from source, exactly 1 byte is written to target, plus roughly 1 additional read on the target for verification. This 1:1:1 read/write/verify pattern means source capacity is the binding constraint for throughput, not target write bandwidth. That shapes the entire architecture: the coordinator's job is to minimize how many times we touch the source, not the target.
Core Entities
- MigrationJob: The top-level unit representing a single migration run. Tracks source location, target location, throughput limit, overall status (pending, running, paused, complete, failed), and job-level metadata like start time and estimated completion.
- MigrationChunk: A contiguous slice of data within the job, described by a byte range or row range. Each chunk carries its SHA-256 source checksum, transfer status (pending, in-progress, verified, failed), and the worker ID that last attempted it.
- WorkerTask: A unit of work dispatched to a specific transfer worker. Maps one chunk to one worker, with a lease expiry time. If the worker does not heartbeat before expiry, the task is reassigned.
- IntegrityReport: The result of comparing source and target checksums for a chunk or the full job. Stores both checksums, comparison result, and timestamp. This is the audit trail that proves the migration completed correctly.
- MigrationCheckpoint: The last durably confirmed progress marker for a job. Written after each chunk is verified. Enables the coordinator to reconstruct in-flight state after a crash without re-scanning the entire job manifest.
Schema details belong in the data model deep dive. These five entities are enough to reason about the full design.
API Design
One endpoint per core functional requirement. The naive shapes work for small migrations, and we'll show where they break at scale.
FR 1: Submit a migration job:
Naive shape:
POST /migrations
Body: { source_path, target_path }
Response: { job_id, status: "pending" }
This works but gives the coordinator no signal about how aggressively to read from the source. A 50-worker migration targeting a production OLTP database would immediately saturate it. The evolved shape adds explicit throttle parameters:
POST /migrations
Body: {
source_path,
target_path,
throughput_limit_gb_per_hour, // e.g. 400
max_parallel_workers, // e.g. 50
priority: "low" | "normal" | "high"
}
Response: { job_id, status: "pending", estimated_completion_at }
The priority field lets the coordinator schedule off-peak work below other jobs sharing the same source. The coordinator uses throughput_limit_gb_per_hour to issue tokens to workers via a rate limiter.
FR 2: Poll job status and progress:
GET /migrations/{job_id}
Response: {
job_id,
status,
total_bytes,
bytes_transferred,
bytes_verified,
chunks_total,
chunks_complete,
estimated_completion_at,
last_failure_reason
}
Status polling is preferable to webhooks here because the migration client is often a one-off ops script, not a long-running service. The response includes both bytes_transferred and bytes_verified separately, because a chunk that transferred but failed checksum verification must be retried.
FR 3: Fetch the integrity verification report:
GET /migrations/{job_id}/report
Response: {
job_id,
overall_status: "pass" | "fail" | "in_progress",
chunks_verified,
chunks_failed,
failed_chunks: [{ chunk_id, source_checksum, target_checksum }]
}
This is a separate endpoint from job status because the report is the compliance artifact. Ops teams and auditors query it after cutover to confirm data fidelity. Pagination applies when chunks_failed is large.
High-Level Design
Start with the simplest system that satisfies one requirement, then evolve.
1. Copy data from source to target
The core data path: a client submits a job, a coordinator splits the source into chunks, and workers transfer each chunk to the target. I'd start an interview answer exactly here, with the simplest possible system that moves bytes. Verification, throttling, and fault tolerance all come later.
Components:
- Client: The operator or automation script that calls
POST /migrations. - Migration Coordinator: Receives the job, discovers the source layout (file tree or row ranges), and subdivides it into chunks. Writes the chunk manifest to the jobs database. Dispatches tasks to workers.
- Transfer Workers: Stateless processes. Each worker pulls a task, reads the assigned chunk from the source, and writes it to the target.
- Jobs DB: Durable store for job metadata and chunk state. Postgres works well at this scale.
- Source System: The origin (on-prem NFS, legacy database, etc.). Workers read from it.
- Target Storage: Cloud object storage (S3, GCS) or a managed database. Workers write to it.
Request walkthrough:
- Client calls
POST /migrationswith source path, target path, and throughput limit. - Coordinator discovers the source file/table layout and creates one
MigrationChunkrecord per 64 MB slice. - Coordinator enqueues a
WorkerTaskfor each chunk. - Each worker dequeues a task, reads the chunk from source, writes it to target.
- Worker marks the chunk as
transferredin the jobs DB.
This diagram covers the happy path: data flows from source to target via parallel workers. It defers checksum verification and failure handling entirely.
2. Verify every byte transferred without corruption
Transferring data is not enough. A bit flip in transit, a partial network write, or a silent storage error can corrupt data without raising an error code. We add an integrity verification layer.
I've seen migrations pass row-count checks but deliver corrupted data to production. One team lost 0.01% of their rows to silent bit flips that only surfaced two weeks after cutover. Byte-level verification is not optional.
New components:
- Integrity Verifier: A separate service (or worker role) that, after a chunk is written, reads it back from the target and computes its SHA-256 checksum. Compares it against the source checksum stored in the chunk manifest.
- Checksum DB: Stores source checksums (computed before transfer) and target checksums (computed after write). The IntegrityReport queries this.
Updated request walkthrough:
- Before transfer, the coordinator (or a pre-flight scanner) computes SHA-256 of each source chunk and stores it in the Checksum DB.
- Worker transfers the chunk to target and marks it
transferred. - Integrity Verifier reads the chunk back from target, recomputes SHA-256.
- Verifier compares against the stored source checksum. Match: chunk marked
verified. Mismatch: chunk markedfailed, re-enqueued for re-transfer.
This diagram adds the verification loop. A chunk is not considered done until the Integrity Verifier marks it verified. Chunks that fail re-enter the transfer queue, but only that chunk, not the whole job.
3. Tolerate machine failures and resume mid-migration
Transfer workers crash. Spot instances get preempted. Network partitions isolate a worker mid-transfer. The naive design loses all progress for that worker's in-flight tasks. We need resumability.
I always ask candidates: "What happens if a worker dies on day 14 of a 30-day migration?" The answer reveals whether they've thought about real operational failure modes or just the happy path.
New components:
- Task Queue: A durable queue (SQS, Kafka, or a DB-backed queue) holding all pending
WorkerTaskitems. Workers pull tasks from the queue. Tasks have a visibility timeout: if a worker does not acknowledge the task within the timeout (e.g., 30 minutes), the queue makes the task visible again for another worker to claim. - Checkpoint Store: Durable storage for the last confirmed verified offset per chunk. On worker restart, the new worker reads the checkpoint and resumes from the last verified byte, not from byte zero.
- Coordinator heartbeat monitor: Watches for workers that have not heartbeated in the last 60 seconds and marks their tasks as stale. The task queue's visibility timeout handles reassignment automatically.
Workers use an idempotent write key derived from (job_id, chunk_id) so that if a worker crashes mid-write and a new worker retries the same chunk, the target storage sees a duplicate write and ignores it (object storage PUT is naturally idempotent). The visibility timeout ensures no chunk is stuck in-progress if its worker vanishes.
Potential Deep Dives
1. How do you ensure no data is lost?
2. How do you maximize throughput without overwhelming the source?
I'd flag this as the operational constraint that defines the architecture. Source protection is a hard constraint, not a performance optimization. Everything else (parallelism, chunk size, worker count) is bounded by how much load the source can safely absorb.
3. How do you handle failures mid-transfer?
I often see candidates underestimate how frequently failures occur at this scale. At 50 workers running for 30 days, you should expect multiple failures per week. The failure handling strategy is not an edge case; it is the normal operating mode.
4. How do you handle schema drift during a multi-week migration?
Schema drift is the silent killer of long-running migrations. I've seen teams complete a 3-week bulk transfer only to discover the source schema changed on day 10, making half the migrated data incompatible with the new application code.
Final Architecture
The complete system, incorporating all three evolved requirements plus the deep dive improvements.
I'd draw this diagram last in an interview and walk the interviewer through the data flow: "A chunk enters at the source, flows through a rate-limited worker, lands on the target, gets verified by an independent reader, and only then is marked done." That one sentence summarizes the entire architecture.
Interview Cheat Sheet
- Split the dataset into 64 MB chunks at the coordinator level. Workers operate on individual chunks, so failures only lose one chunk's work, not the whole job. The coordinator assigns one task per chunk and tracks each independently.
- Coordinator is responsible for splitting input, assigning tasks, detecting failures, and tracking checkpoints. Workers are stateless: given a chunk, read source, write target, compute checksum, report done. The coordinator never does data transfer itself.
- Use SHA-256 checksums per 64 MB chunk stored in a separate DB. Compare source checksum (computed on read) with target checksum (computed on write). Mismatch means re-transfer that chunk only.
- A final reconciliation pass re-reads every target chunk and recomputes the checksum to catch silent corruption introduced after write. This is an independent read by the Integrity Verifier, not the worker's own write buffer. Silent storage errors are only catchable this way.
- Compare record counts between source and target as a fast sanity check before the expensive byte-level verification pass. Count mismatch catches gross failures immediately and avoids burning I/O on a verification pass that will certainly fail.
- Use a task queue with visibility timeouts (60 seconds). Dead worker's task reappears in the queue automatically. No polling or heartbeat required from the coordinator side.
- Chunk writes are idempotent: same chunk ID, same target path. A re-transferred chunk overwrites the corrupted or partial write safely. Workers use a stable key of
{job_id}/{chunk_id}for every PUT. - Coordinator checkpoints progress every 100 chunks. After a coordinator restart, it reads the checkpoint and skips already-verified chunks. This is separate from per-chunk byte-offset checkpoints inside workers.
- Limit source read rate to 10% of peak query load. The source system is still serving production traffic. Saturating its disks causes cascading failures. This is a hard constraint, not a nice-to-have.
- Adaptive throttling: the coordinator watches source DB CPU and read latency. When either exceeds threshold, it reduces active worker count. When both are below threshold for 5 minutes, it adds workers. This is AIMD borrowed from TCP congestion control.
- Parallel workers give linear throughput scaling. 50 workers at 200 MB/s each yields ~10 GB/s aggregate read throughput. On a 1 PB dataset the transfer phase completes in roughly 28 hours of pure transfer time. The 30-day window provides time for verification passes, retries, and throttled off-peak operation.
- Compression at the worker level (LZ4 for speed) reduces network bandwidth by 2-4x for text-heavy datasets. Always benchmark: compressed transfer vs raw transfer for your specific data type. Binary or already-compressed data gains nothing.
- At the end of migration, produce an IntegrityReport: total chunks, verified count, failed count, and a list of any mismatches. This is the audit trail that proves zero data loss. Auditors and compliance teams need this artifact after cutover.