Database Control Plane
Walk through the full control plane design of a distributed database like DynamoDB or CockroachDB: from table provisioning to live shard splitting to failure-driven partition recovery at 100 nodes and 10,000 shards.
What is a database control plane?
A distributed database is two systems stacked on top of each other: the data plane handles reads and writes, while the control plane decides where data lives, who can access it, and what to do when nodes fail or capacity changes. The engineering challenge is that you are designing a distributed system whose job is to manage another distributed system. Every metadata decision must be strongly consistent, because a stale routing entry silently directs a write to the wrong shard.
This question tests control-plane architecture, distributed metadata management, failure detection protocols, live data rebalancing, and the CAP tradeoff in a context where AP-style inconsistency in metadata is not acceptable.
Functional Requirements
Core Requirements
- Create, configure, and delete tables: schema definition, partition key selection, and capacity provisioning.
- Monitor all data plane nodes and automatically recover from node failures via partition reassignment.
- Scale capacity up or down by adding nodes, splitting shards, and rebalancing partitions without downtime.
- Authenticate database clients and enforce per-table access control policies.
Below the Line (out of scope)
- Data plane internals (storage engine, WAL management, compaction, query execution)
- Point-in-time restore and application-level backup restore workflows
- Cross-cluster federation and global table multi-region replication
- Query planner and cost-based optimization
The hardest part in scope: Live shard splitting without downtime. The control plane must coordinate a multi-step partition migration while the data plane continues serving read and write traffic on the exact shards being split. We will dedicate a full deep dive to this problem.
Data plane internals are below the line because the control plane and data plane communicate over a narrow API: the control plane tells data nodes which partitions they own and data nodes report health metrics back. What the data node does with its partitions internally is a separate bounded context.
Point-in-time restore is below the line because it does not change the control plane topology logic. To add it, I would periodically snapshot partition data to object storage and add a RestoreTable API that provisions a new table with data rolled back from a chosen snapshot plus WAL replay.
Cross-cluster federation is too large for a single interview. To add it, I would introduce a global routing tier with a cross-region control plane that uses CRDT-based schema convergence and routes writes to the owning region.
Query planner optimization is below the line because it lives inside the data plane's execution layer, not in the control plane's routing or topology logic. To add it, I would extend the control plane's schema registry to include column statistics (cardinality, null rate, value histograms) and expose them to a dedicated query planner service that constructs execution plans before handing off to the data plane.
Non-Functional Requirements
Core Requirements
- Consistency: Metadata (routing table, schema, topology) must be linearizable (RPO = 0 missed writes; stale routing reads are not acceptable for even 1 second). A stale routing entry silently directs a write to the wrong shard with no error surfaced to the application.
- Availability: 99.99% uptime for the control plane service. A control plane outage means no new tables can be created and node failures cannot be automatically recovered.
- Provisioning latency: Table creation completes within 10 seconds for tables with up to 10 partitions. Clients receive a CREATING status immediately and poll for ACTIVE.
- Failure detection: Node failures are detected within 30 seconds. Partition reassignment (recovery) completes within 120 seconds of detection.
- Scale: Support 1,000 tables per cluster, 100 data plane nodes per cluster, and 10,000 partitions per cluster.
Below the Line
- Sub-second metadata propagation across all nodes (5-second propagation delay is acceptable)
- Multi-region control plane replication (single-region in this design)
Read/write ratio: The control plane sees two entirely different traffic patterns. Provisioning operations (CreateTable, ModifyCapacity, UpdatePolicy) are rare, maybe 1,000 per day cluster-wide. But routing lookups happen on every single data plane request. At 100 nodes processing 10,000 requests per second each, that is 1,000,000 routing lookups per second against the metadata store. The design tension is sharp: writes must be strongly consistent to prevent misroutes; reads must be served from a fast local cache to avoid making the metadata store a bottleneck. I'd call this the central architectural constraint of the whole system.
Metadata must be strongly consistent because a stale routing entry is a write-to-wrong-shard error that is invisible to the application. The 99.99% availability target means the control plane itself must run with Raft-based redundancy, not a single process.
Core Entities
- Table: A named logical table with a schema, partition key definition, and current status (CREATING, ACTIVE, DELETING, SCALING). The table record is the anchor for all provisioning operations.
- Partition: A shard of a table covering a contiguous key range. Carries the key range boundaries, the primary node assignment, replica node assignments, a version counter, and current status (HEALTHY, SPLITTING, MIGRATING).
- Node: A physical data plane server. Carries its endpoint address, health status (HEALTHY, SUSPECTED, FAILED), last heartbeat timestamp, and the list of partition IDs it owns.
- AccessPolicy: A binding of a principal (IAM role or user ARN) to a set of allowed operations (read, write, admin) on a specific table with an allow or deny effect.
- SchemaVersion: An immutable snapshot of a table's column definitions at a given version number. Used for schema evolution and compatibility validation when modifying a live table.
- Credential: A short-lived signed token issued by the control plane that embeds the principal's allowed actions on a specific table with an expiry timestamp. Verified by data nodes without a round-trip to the control plane.
Key fields are shown below for the most important entities; full index and storage optimization details are deferred to a schema deep dive if pursued in the interview.
| Entity | Key Fields |
|---|---|
| Table | table_id (PK), table_name, partition_key, sort_key?, status (CREATING/ACTIVE/DELETING/SCALING), capacity_mode, read_units, write_units, schema_version_id, created_at |
| Partition | partition_id (PK), table_id (FK), key_range_low, key_range_high, primary_node_id, replica_node_ids[], version, status (HEALTHY/SPLITTING/MIGRATING) |
| Node | node_id (PK), endpoint, status (HEALTHY/SUSPECTED/FAILED), last_heartbeat_at, partition_ids[] |
API Design
FR 1 - Create and manage a table:
POST /tables
Body: {
table_name, partition_key, sort_key?,
capacity: { mode: "PROVISIONED"|"ON_DEMAND", read_units?, write_units? },
schema: { attributes: [{ name, type }] }
}
Response: { table_arn, status: "CREATING" }
We return CREATING immediately rather than blocking until ACTIVE. Partition assignment and data node initialization takes several seconds. Clients poll GET /tables/{name} for status. This async pattern mirrors how AWS CloudFormation and DynamoDB handle provisioning.
GET /tables/{table_name}
Response: { table_arn, status, partition_count, capacity, created_at }
DELETE /tables/{table_name}
Response: HTTP 202 Accepted
DELETE returns 202 rather than 204 because partition deallocation and data cleanup are async operations. A 204 would imply the deletion is complete.
FR 2 - Inspect cluster health (operator-facing):
GET /clusters/{cluster_id}/nodes
Response: {
nodes: [{ node_id, endpoint, status, partitions_owned,
metrics: { cpu_pct, disk_pct, ops_per_sec, replication_lag_ms } }]
}
FR 3 - Modify table capacity:
PUT /tables/{table_name}/capacity
Body: { mode: "PROVISIONED", read_units: 5000, write_units: 1000 }
Response: { status: "SCALING", estimated_completion_seconds: 45 }
Capacity changes are also async. A SCALING status means the Partition Assigner may be splitting or merging shards in the background. The client polls the table status until it returns to ACTIVE.
FR 4 - Manage access policies and issue credentials:
PUT /tables/{table_name}/access-policies/{principal_arn}
Body: { actions: ["read", "write"], effect: "allow" }
Response: { policy_version }
POST /tables/{table_name}/credentials
Body: { principal_arn, ttl_seconds: 3600 }
Response: { token, expires_at }
POST /credentials issues short-lived signed tokens, not long-lived API keys. A leaked long-lived key gives an attacker indefinite access. A 1-hour token limits blast radius to a narrow window and rotates automatically without any client credential management.
High-Level Design
1. Creating and configuring a table
The write path for table provisioning is a multi-step workflow, not a single synchronous call.
The naive approach is a single HTTP endpoint that registers a schema and returns 200. The problem is that table creation actually involves partition math (how many shards for this capacity?), node selection (which healthy nodes should own them?), and data node initialization (allocate storage, notify the node). None of that fits in a synchronous HTTP response. The correct pattern is: write intent to Metadata Store, return CREATING, let a background coordinator drive the workflow to completion.
Components:
- CP API Service: Stateless HTTP service. Validates the CreateTable request, writes the table record with status
CREATINGto the Metadata Store, and returns immediately. - Metadata Store: A Raft-based strongly consistent KV store (like etcd). The single source of truth for the routing table, schema registry, and topology. Every control plane decision writes here first.
- Partition Assigner: Reads pending CreateTable jobs. Computes the initial partition count from the requested capacity. Assigns each partition to a healthy data node. Writes the partition map to the Metadata Store.
- Data Plane Nodes: Receive partition ownership notifications. Allocate storage for the new partition. Reply to the CP API with an ACK.
Request walkthrough:
- Client sends
POST /tableswith schema and capacity parameters. - CP API validates the request, writes
Table(status=CREATING)and a CreateTable job to the Metadata Store. - CP API returns
{ status: "CREATING" }to the client immediately. - Partition Assigner reads the pending job. For a small table (100 WCU), it assigns 1 partition. For a large table (100,000 WCU), it may assign 100+ partitions.
- Partition Assigner writes the partition map (partition_id, key_range, node_id) to the Metadata Store.
- CP API notifies each assigned data node of its new partition assignments via a push notification.
- Data nodes allocate storage and ACK. All ACKs received: CP API updates the table status to
ACTIVE. - Client polling
GET /tables/{name}eventually seesACTIVE.
This covers the write path only. The data plane request path (how clients route reads and writes to the correct node) is handled by the routing table in the Metadata Store, which data plane clients cache locally.
2. Detecting node failures and triggering recovery
The control plane must detect failures independently of the data plane, because a failed node cannot report itself as failed.
Each data plane node sends a heartbeat to the Health Monitor every 10 seconds. If no heartbeat arrives for 30 seconds, the node is marked SUSPECTED. If still absent at 60 seconds, it is marked FAILED. The two-stage window prevents false positives from transient network hiccups. A node that was SUSPECTED and recovers before 60 seconds is immediately returned to HEALTHY.
When a node hits FAILED:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.