Key-Value Store
Design a distributed key-value store like DynamoDB or Cassandra: from a single-node hash map to a consistent-hashing ring with replication, quorum reads, and tunable consistency.
What is a key-value store?
A key-value store maps arbitrary keys to arbitrary values with three operations: put, get, and delete. The apparent simplicity is deceptive. Making those three operations work reliably across hundreds of nodes, surviving node crashes, and still returning results in under 10ms is genuinely hard.
I like this question because it strips away application-layer complexity and forces you straight into distributed systems fundamentals. This question tests consistent hashing, replication theory, quorum reads and writes, LSM tree internals, and every tricky corner of the CAP theorem simultaneously.
Functional Requirements
Core Requirements
put(key, value): write or overwrite a value for a key.get(key): retrieve the value for a key.delete(key): remove a key-value pair.- Horizontal scaling across many nodes with no downtime for node additions or removals.
Below the Line (out of scope)
- Range scans and secondary indexes
- Full SQL semantics
- Cross-key transactions (ACID)
The hardest part in scope: Distributing data consistently across N nodes while surviving failures is the single hardest problem here. Everything else in this article is in service of that one constraint.
Range scans and secondary indexes are below the line because they require sorted on-disk structures or separate index tables. To add them, layer a sorted SSTable-based scan path on top of the LSM tree and build secondary indexes as separate key spaces that map index values back to primary keys.
Full SQL semantics are below the line because they require a query planner, JOIN support, and transactions spanning multiple keys. An existing relational engine like CockroachDB or YugabyteDB is better suited here than a custom key-value layer.
Cross-key ACID transactions are below the line because they require distributed locking or a multi-version concurrency control (MVCC) layer. To add limited transaction support, use optimistic locking: read a set of keys with their vector clock versions, write all or none with a conditional check that aborts if any version changed.
Non-Functional Requirements
Core Requirements
- Availability: 99.99% uptime. The store favors availability over consistency (eventual consistency by default, tunable to strong).
- Latency: Sub-10ms p99 for both
getandput. - Scale: 10 TB total data, 100K writes/sec, 1M reads/sec at peak.
- Durability: Data survives individual node failures and restarts.
Below the Line
- Cross-region active-active replication
- Point-in-time recovery from before an application bug
- Per-key TTL-based eviction (can be layered on top)
Read/write ratio: Reads outpace writes 10:1 at peak. That ratio drives two decisions: use LSM trees on each node (optimized for write throughput, sequential I/O) and place a tiered read cache above the node tier. Almost every design decision in this article traces back to surviving 1M reads/sec without starving the 100K writes/sec.
I treat the sub-10ms p99 latency as the forcing constraint. Any approach that adds a synchronous network hop per read or forces random disk seeks is architecturally wrong for this system.
Core Entities
- KeyValueEntry: The stored record containing the key, value bytes, version (vector clock or timestamp), and an optional TTL.
- Node: A physical or virtual machine in the cluster, owning a slice of the key space and running an LSM storage engine locally.
- VNode (Virtual Node): A logical partition token assigned to a physical node. Each physical node owns multiple vnodes, distributing load more evenly across the ring.
- ReplicationGroup: The set of N nodes (typically 3) responsible for a given key, selected by walking clockwise on the consistent hashing ring from the key's hash position.
The full schema for KeyValueEntry includes the vector clock and tombstone flag for deletes. I will revisit schema details in the conflict resolution deep dive.
API Design
Start with one endpoint per functional requirement, then note where routing transparency matters.
FR 1 - put:
PUT /keys/{key}
Body: { "value": "<bytes>", "ttl_seconds": 3600 }
Response: 200 OK
FR 2 - get:
GET /keys/{key}
Response: { "key": "mykey", "value": "<bytes>", "version": "1:3,2:1" }
FR 3 - delete:
DELETE /keys/{key}
Response: 204 No Content
Use PUT for writes, not POST. PUT has idempotent upsert semantics: repeated calls with the same key and value are safe and produce the same result. POST implies creating a new resource each time, which is wrong for a key-value store where the key is the identity.
The version field in the get response is a serialized vector clock. The client needs it to perform conditional updates and for the store to detect concurrent write conflicts. Without a version, two clients retrieving the same key simultaneously have no way to know their writes are causally concurrent.
Routing is transparent to the client. A coordinator node (or a smart client library, in the Dynamo style) resolves which physical nodes own the key and fans out requests. Clients always talk to the same coordinator endpoint, not directly to storage nodes. Routing via consistent hashing is covered in the deep dives.
High-Level Design
1. Single-node store
The simplest store that satisfies the put/get/delete requirements: a single server with an in-memory hash map, backed by an append-only write-ahead log (WAL) for durability.
Components:
- Client: Sends HTTP requests for put/get/delete.
- Server: Holds the in-memory hash map. On every write, appends the operation to the WAL before acknowledging.
- WAL (Write-Ahead Log): An append-only log on disk. On crash recovery, the server replays the WAL to rebuild the in-memory state.
- Disk: Stores the WAL file. The hash map itself lives in RAM.
Request walkthrough (put):
- Client sends
PUT /keys/session:abc123with the value. - Server appends
{op: SET, key, value, timestamp}to the WAL on disk. - Server updates the in-memory hash map:
map["session:abc123"] = value. - Server returns
200 OK.
Request walkthrough (get):
- Client sends
GET /keys/session:abc123. - Server reads from the in-memory hash map: O(1) lookup.
- Server returns the value.
This works perfectly for a small dataset. Two things break at scale: the entire dataset must fit in RAM, and one node dying takes the whole store offline. Both are solved by distributing data across nodes.
I always start with the single-node design on the whiteboard before jumping to distribution. Interviewers want to see that you can reason about the simplest version first and then articulate exactly what breaks before reaching for distributed infrastructure.
2. Partitioning data across nodes (consistent hashing)
One server can hold maybe a few hundred GB in RAM before cost becomes untenable. With 10 TB of data, you need to spread keys across many nodes.
The naive approach: modulo hashing. Assign a key to node i using node_id = hash(key) % N. This is simple and fast. The problem surfaces the moment you add or remove a node. When N changes from 10 to 11, every key whose hash(key) % N maps to a different bucket must move to a new node. On average that is (N-1)/N of all keys; adding one node to a 10-node cluster triggers migration of roughly 90% of all keys. That is catastrophic for a production system.
The fix: consistent hashing. Treat the key space as a ring from 0 to 2^32. Each node is assigned a position on the ring by hashing a node identifier. A key maps to the first node clockwise from hash(key) on the ring. When a node is added, it only takes keys from its immediate clockwise neighbor. When a node is removed, its keys shift one step clockwise. In both cases, only ~1/N keys move.
One token per physical machine produces uneven arc sizes: a node that gets unlucky with its ring position ends up owning 30% of the key space while another owns 5%. Virtual nodes (vnodes) fix this. Each physical node owns K tokens spread around the ring (K=150 is typical). The arc sizes average out, and load distribution approaches uniform.
Components:
- Consistent hashing ring: A sorted data structure (e.g., a TreeMap) mapping token positions to node IDs.
- Node A, B, C...: Storage nodes, each owning multiple vnode arcs on the ring.
- Coordinator: Accepts requests from clients, looks up the ring to find the responsible node(s), and routes accordingly.
Request walkthrough (put):
- Client sends
PUT /keys/user:42to the coordinator. - Coordinator computes
token = hash("user:42"). - Coordinator binary-searches the ring for the first token position >= the key's hash.
- Coordinator routes the request to the owning node.
- Owning node stores the key-value pair and returns
200 OK.
Consistent hashing with vnodes is the partitioning strategy used by DynamoDB, Cassandra, and Riak. When an interviewer asks how you distribute data, this is the answer.
I find that drawing the ring with three or four labeled tokens and walking through one key lookup is worth more than five minutes of verbal explanation. Interviewers remember the visual.
3. Replication and failure handling
One replica means a node crash loses all data on that node. With 100 nodes in a cluster, you expect a node failure roughly every few days. Single replicas are not acceptable.
The naive approach: write only to the primary. A single primary per partition crashes and the key is unreachable until recovery. Recovery from a fresh replica resync can take minutes to hours depending on data size.
The fix: replication factor N=3. The coordinator sends each put to the primary node and two replica nodes. The three nodes for a given key are the first three clockwise nodes on the ring from the key's hash position. With N=3, the cluster tolerates one node failure without data loss.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.