Database Control Plane
Walk through the full control plane design of a distributed database like DynamoDB or CockroachDB: from table provisioning to live shard splitting to failure-driven partition recovery at 100 nodes and 10,000 shards.
What is a database control plane?
A distributed database is two systems stacked on top of each other: the data plane handles reads and writes, while the control plane decides where data lives, who can access it, and what to do when nodes fail or capacity changes. The engineering challenge is that you are designing a distributed system whose job is to manage another distributed system. Every metadata decision must be strongly consistent, because a stale routing entry silently directs a write to the wrong shard.
This question tests control-plane architecture, distributed metadata management, failure detection protocols, live data rebalancing, and the CAP tradeoff in a context where AP-style inconsistency in metadata is not acceptable.
Functional Requirements
Core Requirements
- Create, configure, and delete tables: schema definition, partition key selection, and capacity provisioning.
- Monitor all data plane nodes and automatically recover from node failures via partition reassignment.
- Scale capacity up or down by adding nodes, splitting shards, and rebalancing partitions without downtime.
- Authenticate database clients and enforce per-table access control policies.
Below the Line (out of scope)
- Data plane internals (storage engine, WAL management, compaction, query execution)
- Point-in-time restore and application-level backup restore workflows
- Cross-cluster federation and global table multi-region replication
- Query planner and cost-based optimization
The hardest part in scope: Live shard splitting without downtime. The control plane must coordinate a multi-step partition migration while the data plane continues serving read and write traffic on the exact shards being split. We will dedicate a full deep dive to this problem.
Data plane internals are below the line because the control plane and data plane communicate over a narrow API: the control plane tells data nodes which partitions they own and data nodes report health metrics back. What the data node does with its partitions internally is a separate bounded context.
Point-in-time restore is below the line because it does not change the control plane topology logic. To add it, I would periodically snapshot partition data to object storage and add a RestoreTable API that provisions a new table with data rolled back from a chosen snapshot plus WAL replay.
Cross-cluster federation is too large for a single interview. To add it, I would introduce a global routing tier with a cross-region control plane that uses CRDT-based schema convergence and routes writes to the owning region.
Query planner optimization is below the line because it lives inside the data plane's execution layer, not in the control plane's routing or topology logic. To add it, I would extend the control plane's schema registry to include column statistics (cardinality, null rate, value histograms) and expose them to a dedicated query planner service that constructs execution plans before handing off to the data plane.
Non-Functional Requirements
Core Requirements
- Consistency: Metadata (routing table, schema, topology) must be linearizable (RPO = 0 missed writes; stale routing reads are not acceptable for even 1 second). A stale routing entry silently directs a write to the wrong shard with no error surfaced to the application.
- Availability: 99.99% uptime for the control plane service. A control plane outage means no new tables can be created and node failures cannot be automatically recovered.
- Provisioning latency: Table creation completes within 10 seconds for tables with up to 10 partitions. Clients receive a CREATING status immediately and poll for ACTIVE.
- Failure detection: Node failures are detected within 30 seconds. Partition reassignment (recovery) completes within 120 seconds of detection.
- Scale: Support 1,000 tables per cluster, 100 data plane nodes per cluster, and 10,000 partitions per cluster.
Below the Line
- Sub-second metadata propagation across all nodes (5-second propagation delay is acceptable)
- Multi-region control plane replication (single-region in this design)
Read/write ratio: The control plane sees two entirely different traffic patterns. Provisioning operations (CreateTable, ModifyCapacity, UpdatePolicy) are rare, maybe 1,000 per day cluster-wide. But routing lookups happen on every single data plane request. At 100 nodes processing 10,000 requests per second each, that is 1,000,000 routing lookups per second against the metadata store. The design tension is sharp: writes must be strongly consistent to prevent misroutes; reads must be served from a fast local cache to avoid making the metadata store a bottleneck. I'd call this the central architectural constraint of the whole system.
Metadata must be strongly consistent because a stale routing entry is a write-to-wrong-shard error that is invisible to the application. The 99.99% availability target means the control plane itself must run with Raft-based redundancy, not a single process.
Core Entities
- Table: A named logical table with a schema, partition key definition, and current status (CREATING, ACTIVE, DELETING, SCALING). The table record is the anchor for all provisioning operations.
- Partition: A shard of a table covering a contiguous key range. Carries the key range boundaries, the primary node assignment, replica node assignments, a version counter, and current status (HEALTHY, SPLITTING, MIGRATING).
- Node: A physical data plane server. Carries its endpoint address, health status (HEALTHY, SUSPECTED, FAILED), last heartbeat timestamp, and the list of partition IDs it owns.
- AccessPolicy: A binding of a principal (IAM role or user ARN) to a set of allowed operations (read, write, admin) on a specific table with an allow or deny effect.
- SchemaVersion: An immutable snapshot of a table's column definitions at a given version number. Used for schema evolution and compatibility validation when modifying a live table.
- Credential: A short-lived signed token issued by the control plane that embeds the principal's allowed actions on a specific table with an expiry timestamp. Verified by data nodes without a round-trip to the control plane.
Key fields are shown below for the most important entities; full index and storage optimization details are deferred to a schema deep dive if pursued in the interview.
| Entity | Key Fields |
|---|---|
| Table | table_id (PK), table_name, partition_key, sort_key?, status (CREATING/ACTIVE/DELETING/SCALING), capacity_mode, read_units, write_units, schema_version_id, created_at |
| Partition | partition_id (PK), table_id (FK), key_range_low, key_range_high, primary_node_id, replica_node_ids[], version, status (HEALTHY/SPLITTING/MIGRATING) |
| Node | node_id (PK), endpoint, status (HEALTHY/SUSPECTED/FAILED), last_heartbeat_at, partition_ids[] |
API Design
FR 1 - Create and manage a table:
POST /tables
Body: {
table_name, partition_key, sort_key?,
capacity: { mode: "PROVISIONED"|"ON_DEMAND", read_units?, write_units? },
schema: { attributes: [{ name, type }] }
}
Response: { table_arn, status: "CREATING" }
We return CREATING immediately rather than blocking until ACTIVE. Partition assignment and data node initialization takes several seconds. Clients poll GET /tables/{name} for status. This async pattern mirrors how AWS CloudFormation and DynamoDB handle provisioning.
GET /tables/{table_name}
Response: { table_arn, status, partition_count, capacity, created_at }
DELETE /tables/{table_name}
Response: HTTP 202 Accepted
DELETE returns 202 rather than 204 because partition deallocation and data cleanup are async operations. A 204 would imply the deletion is complete.
FR 2 - Inspect cluster health (operator-facing):
GET /clusters/{cluster_id}/nodes
Response: {
nodes: [{ node_id, endpoint, status, partitions_owned,
metrics: { cpu_pct, disk_pct, ops_per_sec, replication_lag_ms } }]
}
FR 3 - Modify table capacity:
PUT /tables/{table_name}/capacity
Body: { mode: "PROVISIONED", read_units: 5000, write_units: 1000 }
Response: { status: "SCALING", estimated_completion_seconds: 45 }
Capacity changes are also async. A SCALING status means the Partition Assigner may be splitting or merging shards in the background. The client polls the table status until it returns to ACTIVE.
FR 4 - Manage access policies and issue credentials:
PUT /tables/{table_name}/access-policies/{principal_arn}
Body: { actions: ["read", "write"], effect: "allow" }
Response: { policy_version }
POST /tables/{table_name}/credentials
Body: { principal_arn, ttl_seconds: 3600 }
Response: { token, expires_at }
POST /credentials issues short-lived signed tokens, not long-lived API keys. A leaked long-lived key gives an attacker indefinite access. A 1-hour token limits blast radius to a narrow window and rotates automatically without any client credential management.
High-Level Design
1. Creating and configuring a table
The write path for table provisioning is a multi-step workflow, not a single synchronous call.
The naive approach is a single HTTP endpoint that registers a schema and returns 200. The problem is that table creation actually involves partition math (how many shards for this capacity?), node selection (which healthy nodes should own them?), and data node initialization (allocate storage, notify the node). None of that fits in a synchronous HTTP response. The correct pattern is: write intent to Metadata Store, return CREATING, let a background coordinator drive the workflow to completion.
Components:
- CP API Service: Stateless HTTP service. Validates the CreateTable request, writes the table record with status
CREATINGto the Metadata Store, and returns immediately. - Metadata Store: A Raft-based strongly consistent KV store (like etcd). The single source of truth for the routing table, schema registry, and topology. Every control plane decision writes here first.
- Partition Assigner: Reads pending CreateTable jobs. Computes the initial partition count from the requested capacity. Assigns each partition to a healthy data node. Writes the partition map to the Metadata Store.
- Data Plane Nodes: Receive partition ownership notifications. Allocate storage for the new partition. Reply to the CP API with an ACK.
Request walkthrough:
- Client sends
POST /tableswith schema and capacity parameters. - CP API validates the request, writes
Table(status=CREATING)and a CreateTable job to the Metadata Store. - CP API returns
{ status: "CREATING" }to the client immediately. - Partition Assigner reads the pending job. For a small table (100 WCU), it assigns 1 partition. For a large table (100,000 WCU), it may assign 100+ partitions.
- Partition Assigner writes the partition map (partition_id, key_range, node_id) to the Metadata Store.
- CP API notifies each assigned data node of its new partition assignments via a push notification.
- Data nodes allocate storage and ACK. All ACKs received: CP API updates the table status to
ACTIVE. - Client polling
GET /tables/{name}eventually seesACTIVE.
This covers the write path only. The data plane request path (how clients route reads and writes to the correct node) is handled by the routing table in the Metadata Store, which data plane clients cache locally.
2. Detecting node failures and triggering recovery
The control plane must detect failures independently of the data plane, because a failed node cannot report itself as failed.
Each data plane node sends a heartbeat to the Health Monitor every 10 seconds. If no heartbeat arrives for 30 seconds, the node is marked SUSPECTED. If still absent at 60 seconds, it is marked FAILED. The two-stage window prevents false positives from transient network hiccups. A node that was SUSPECTED and recovers before 60 seconds is immediately returned to HEALTHY.
When a node hits FAILED:
- Health Monitor publishes a
NodeFailedevent. - Recovery Coordinator reads that node's partition assignments from the Metadata Store.
- For each partition, the Coordinator selects a healthy replica node to become the new primary (promoting an existing replica is faster than copying data from scratch).
- Recovery Coordinator atomically writes the updated routing table to the Metadata Store.
- Data plane routing caches are invalidated and refreshed on the next TTL expiry (or via a push notification for latency-sensitive deployments).
- A background replication job starts to bring the affected partitions back to full replica count.
Routing cache invalidation latency determines your inconsistency window. If data nodes cache the routing table for 5 seconds, there is a 5-second window where a client could be directed to the old (failed) node. The safest approach is to return a routing error from the dead node, forcing the client to retry and pick up the updated routing table.
Components added in this step:
- Health Monitor: Tracks heartbeats from all data plane nodes. Manages the HEALTHY/SUSPECTED/FAILED state machine.
- Recovery Coordinator: Reacts to
NodeFailedevents. Reads affected partitions, selects replacement nodes, writes updated routing to the Metadata Store.
Recovery completes within our 120-second SLA because promoting an existing replica (already has data) takes seconds, not minutes. The replica rebuild runs in the background and does not block client traffic.
3. Scaling capacity without downtime
Scaling has two distinct paths: scale out (add nodes and rebalance), and shard splitting (subdivide a hot partition).
Both operations must happen while the data plane continues serving traffic. Taking a partition offline to rebalance is not an option given the 99.99% availability NFR.
Scale out (add a new node to the cluster):
- Operator adds a node (via API or auto-scaler). The new node registers itself in the Metadata Store.
- The Scaler detects the imbalance (some nodes own more partitions than others).
- For each partition to be migrated, the Scaler starts a background copy of the partition data to the new node.
- While the copy runs, the source node continues serving traffic. New writes to this partition are also replicated to the destination node.
- When the copy catches up to within a small lag, the Scaler atomically swaps the routing table entry: the new node becomes the primary for this partition.
- The old node stops serving the partition and cleans up its local copy.
Shard splitting (a single partition is too hot):
- The Health Monitor detects a partition exceeding its ops/sec or storage threshold.
- The Scaler selects the split key (median key in the partition's key space).
- Data is divided: keys below the split point stay on the current node, keys above are copied to a second node.
- The Metadata Store is updated atomically: the single routing entry is replaced with two entries covering the two new key ranges.
- Clients see either the pre-split routing table or the post-split routing table. They never see a partial state.
The atomic routing table update is the key mechanism. Because the Metadata Store is Raft-based, updates are linearizable. A client that reads the routing table before the update always gets the old single-range entry. A client that reads after always gets both new range entries. There is no moment where the routing table is inconsistent.
Components added in this step:
- Scaler: Monitors partition load metrics. Triggers rebalance operations when imbalance or hot partition conditions are detected.
4. Authenticating clients and enforcing per-table access control
The naive approach puts AuthZ on the hot data plane path; the correct approach moves it off that path entirely.
Naive approach: every data plane read/write request calls the control plane's AuthZ endpoint to validate the client's credentials against the table's access policies. This adds a control-plane network round-trip (~5-20ms) to every single data operation, and creates a hard availability dependency: if the control plane is degraded, no data operations proceed.
The better approach is token-based: the client authenticates once to the control plane (via IAM), receives a short-lived signed JWT that embeds the principal's allowed actions on the specific table. The data node validates the JWT signature locally, using the control plane's public key, without any round-trip. The JWT carries: table_arn, principal_arn, actions, issued_at, expires_at, signed with the control plane's private key.
Components added in this step:
- Auth Service: Part of the CP API Service. Accepts
POST /tables/{name}/credentialscalls. Validates the principal against IAM, checks the table's AccessPolicy, signs a JWT with the control plane's private key, and returns it. - Token Cache (data nodes): Each data node caches the control plane's public key and verifies token signatures locally. No control plane round-trip on the hot path.
The data node never calls the control plane during a live data request. A compromised JWT is valid for at most 1 hour (the configured TTL). We rotate the signing key pair periodically; when the new public key is published, data nodes refresh their local key cache.
5. Monitoring and alerting
Four metrics drive the entire operational story of the control plane.
The Health Monitor tracks data plane ops/sec per partition. When a partition's write rate exceeds its provisioned WCU threshold (e.g., 110% for 60 seconds), the Scaler is triggered to split or rebalance. This is the primary auto-scaling signal.
Replication lag per partition measures how far behind a replica is from its primary. A lag above 1 second triggers an alert because it widens the data loss window if the primary fails during that window. A lag above 5 seconds triggers a pager alert because it indicates a replica that may be too far behind to promote quickly.
Metadata Store write latency (p99) tracks Raft health. Under normal conditions, a Raft write commits in under 10ms. If p99 latency rises above 100ms, the Raft group is under contention and all provisioning operations will slow.
Heartbeat staleness is the raw input to the failure state machine: no heartbeat for 30 seconds means SUSPECTED, 60 seconds means FAILED. This is the most operationally important alert because a FAILED node without immediate recovery means lost capacity.
Potential Deep Dives
1. How does the control plane detect node failures and trigger recovery?
There are three constraints to design against:
- Failures must be detected within 30 seconds, not minutes.
- False positives (marking a slow but alive node as FAILED) cause unnecessary data movement.
- The failure detection mechanism must itself be highly available: if the detector fails, the whole cluster's recovery stops.
2. How does the control plane assign partitions and manage the routing table?
There are three constraints to design against:
- Partition assignment must spread data and load evenly across all nodes.
- Adding a new node should require moving only 1/N of total data, not a full reshuffle.
- The routing table must support fast point lookups and range scans.
3. How does the control plane split a hot partition without downtime?
The shard split problem is one of the hardest coordination problems in the entire system. The partition is serving live traffic. We cannot take it offline. We need to atomically transition from one routing entry to two, while ensuring no write is lost and no read returns stale data.
4. How does the control plane manage access control and credentials?
The AuthZ mechanism must satisfy two competing demands: correctness (only authorized principals can access tables) and performance (no control plane round-trip on the live data path).
Final Architecture
The Metadata Store is the single architectural load-bearing wall. Every component reads from or writes to it, but no component calls another component directly. This means the control plane is decoupled: the Partition Assigner, Scaler, Health Monitor, and Recovery Coordinator all operate on the shared Metadata Store independently, making each component replaceable and independently scalable. The data plane never calls the control plane during live requests: the JWT-based AuthZ and the locally cached routing table keep the hot path free of control plane dependencies.
Interview Cheat Sheet
- State the control plane vs. data plane split immediately: the control plane manages topology, schema, and access; the data plane handles actual reads and writes.
- The control plane's metadata must be strongly consistent (CP over AP). A stale routing table directs a write to the wrong shard silently.
- Use etcd or a 5-node Raft group for the Metadata Store. Single-node or AP metadata stores are not acceptable for routing table updates.
- Table provisioning is async: return
CREATINGimmediately, drive partition assignment via a background coordinator, poll forACTIVE. Blocking on 10+ partition initializations in a synchronous HTTP call is a reliability risk. - Use virtual nodes (V=150 vnodes per physical node) for partition assignment. Even distribution without hotspots, and adding a new node moves only 1/N of total data with load spread across the whole cluster.
- SWIM gossip detects node failures in O(log N) rounds, about 7 seconds at 100 nodes. Pair it with a Raft group to ensure authoritative NodeFailed decisions are made exactly once, not by racing coordinators.
- Live shard splits use a copy-then-atomic-switch pattern: copy data in the background, stream WAL deltas, pause writes for under 500ms, commit an atomic routing table update via Raft, resume.
- The 500ms write pause is the only visible impact of a shard split. Total split cost is tens of milliseconds of user-visible latency, not minutes of downtime.
- Never perform AuthZ on the hot data plane path via a control plane call. Issue short-lived JWTs (1-hour TTL) at the control plane; verify locally at the data node using the CP public key.
- JWT revocation before expiry: maintain a small Redis blocklist of revoked
jticlaims with TTL matching remaining lifetime. The blocklist is orders of magnitude smaller than total active tokens. - Access policies are per-table, per-principal, and stored in the Metadata Store. Changes propagate lazily via JWT TTL expiry, not immediately via cache invalidation.
- The Metadata Store is the single load-bearing wall. No control plane component calls another directly. All state is mediated through the Metadata Store, making components independently deployable and scalable.
- Monitoring includes: data plane ops/sec per partition (hot shard detection), replication lag per partition (recovery completeness), control plane metadata write latency (Raft health), and node heartbeat staleness (failure detection SLA).