Key-Value Store
Design a distributed key-value store like DynamoDB or Cassandra: from a single-node hash map to a consistent-hashing ring with replication, quorum reads, and tunable consistency.
What is a key-value store?
A key-value store maps arbitrary keys to arbitrary values with three operations: put, get, and delete. The apparent simplicity is deceptive. Making those three operations work reliably across hundreds of nodes, surviving node crashes, and still returning results in under 10ms is genuinely hard.
I like this question because it strips away application-layer complexity and forces you straight into distributed systems fundamentals. This question tests consistent hashing, replication theory, quorum reads and writes, LSM tree internals, and every tricky corner of the CAP theorem simultaneously.
Functional Requirements
Core Requirements
put(key, value): write or overwrite a value for a key.get(key): retrieve the value for a key.delete(key): remove a key-value pair.- Horizontal scaling across many nodes with no downtime for node additions or removals.
Below the Line (out of scope)
- Range scans and secondary indexes
- Full SQL semantics
- Cross-key transactions (ACID)
The hardest part in scope: Distributing data consistently across N nodes while surviving failures is the single hardest problem here. Everything else in this article is in service of that one constraint.
Range scans and secondary indexes are below the line because they require sorted on-disk structures or separate index tables. To add them, layer a sorted SSTable-based scan path on top of the LSM tree and build secondary indexes as separate key spaces that map index values back to primary keys.
Full SQL semantics are below the line because they require a query planner, JOIN support, and transactions spanning multiple keys. An existing relational engine like CockroachDB or YugabyteDB is better suited here than a custom key-value layer.
Cross-key ACID transactions are below the line because they require distributed locking or a multi-version concurrency control (MVCC) layer. To add limited transaction support, use optimistic locking: read a set of keys with their vector clock versions, write all or none with a conditional check that aborts if any version changed.
Non-Functional Requirements
Core Requirements
- Availability: 99.99% uptime. The store favors availability over consistency (eventual consistency by default, tunable to strong).
- Latency: Sub-10ms p99 for both
getandput. - Scale: 10 TB total data, 100K writes/sec, 1M reads/sec at peak.
- Durability: Data survives individual node failures and restarts.
Below the Line
- Cross-region active-active replication
- Point-in-time recovery from before an application bug
- Per-key TTL-based eviction (can be layered on top)
Read/write ratio: Reads outpace writes 10:1 at peak. That ratio drives two decisions: use LSM trees on each node (optimized for write throughput, sequential I/O) and place a tiered read cache above the node tier. Almost every design decision in this article traces back to surviving 1M reads/sec without starving the 100K writes/sec.
I treat the sub-10ms p99 latency as the forcing constraint. Any approach that adds a synchronous network hop per read or forces random disk seeks is architecturally wrong for this system.
Core Entities
- KeyValueEntry: The stored record containing the key, value bytes, version (vector clock or timestamp), and an optional TTL.
- Node: A physical or virtual machine in the cluster, owning a slice of the key space and running an LSM storage engine locally.
- VNode (Virtual Node): A logical partition token assigned to a physical node. Each physical node owns multiple vnodes, distributing load more evenly across the ring.
- ReplicationGroup: The set of N nodes (typically 3) responsible for a given key, selected by walking clockwise on the consistent hashing ring from the key's hash position.
The full schema for KeyValueEntry includes the vector clock and tombstone flag for deletes. I will revisit schema details in the conflict resolution deep dive.
API Design
Start with one endpoint per functional requirement, then note where routing transparency matters.
FR 1 - put:
PUT /keys/{key}
Body: { "value": "<bytes>", "ttl_seconds": 3600 }
Response: 200 OK
FR 2 - get:
GET /keys/{key}
Response: { "key": "mykey", "value": "<bytes>", "version": "1:3,2:1" }
FR 3 - delete:
DELETE /keys/{key}
Response: 204 No Content
Use PUT for writes, not POST. PUT has idempotent upsert semantics: repeated calls with the same key and value are safe and produce the same result. POST implies creating a new resource each time, which is wrong for a key-value store where the key is the identity.
The version field in the get response is a serialized vector clock. The client needs it to perform conditional updates and for the store to detect concurrent write conflicts. Without a version, two clients retrieving the same key simultaneously have no way to know their writes are causally concurrent.
Routing is transparent to the client. A coordinator node (or a smart client library, in the Dynamo style) resolves which physical nodes own the key and fans out requests. Clients always talk to the same coordinator endpoint, not directly to storage nodes. Routing via consistent hashing is covered in the deep dives.
High-Level Design
1. Single-node store
The simplest store that satisfies the put/get/delete requirements: a single server with an in-memory hash map, backed by an append-only write-ahead log (WAL) for durability.
Components:
- Client: Sends HTTP requests for put/get/delete.
- Server: Holds the in-memory hash map. On every write, appends the operation to the WAL before acknowledging.
- WAL (Write-Ahead Log): An append-only log on disk. On crash recovery, the server replays the WAL to rebuild the in-memory state.
- Disk: Stores the WAL file. The hash map itself lives in RAM.
Request walkthrough (put):
- Client sends
PUT /keys/session:abc123with the value. - Server appends
{op: SET, key, value, timestamp}to the WAL on disk. - Server updates the in-memory hash map:
map["session:abc123"] = value. - Server returns
200 OK.
Request walkthrough (get):
- Client sends
GET /keys/session:abc123. - Server reads from the in-memory hash map: O(1) lookup.
- Server returns the value.
This works perfectly for a small dataset. Two things break at scale: the entire dataset must fit in RAM, and one node dying takes the whole store offline. Both are solved by distributing data across nodes.
I always start with the single-node design on the whiteboard before jumping to distribution. Interviewers want to see that you can reason about the simplest version first and then articulate exactly what breaks before reaching for distributed infrastructure.
2. Partitioning data across nodes (consistent hashing)
One server can hold maybe a few hundred GB in RAM before cost becomes untenable. With 10 TB of data, you need to spread keys across many nodes.
The naive approach: modulo hashing. Assign a key to node i using node_id = hash(key) % N. This is simple and fast. The problem surfaces the moment you add or remove a node. When N changes from 10 to 11, every key whose hash(key) % N maps to a different bucket must move to a new node. On average that is (N-1)/N of all keys; adding one node to a 10-node cluster triggers migration of roughly 90% of all keys. That is catastrophic for a production system.
The fix: consistent hashing. Treat the key space as a ring from 0 to 2^32. Each node is assigned a position on the ring by hashing a node identifier. A key maps to the first node clockwise from hash(key) on the ring. When a node is added, it only takes keys from its immediate clockwise neighbor. When a node is removed, its keys shift one step clockwise. In both cases, only ~1/N keys move.
One token per physical machine produces uneven arc sizes: a node that gets unlucky with its ring position ends up owning 30% of the key space while another owns 5%. Virtual nodes (vnodes) fix this. Each physical node owns K tokens spread around the ring (K=150 is typical). The arc sizes average out, and load distribution approaches uniform.
Components:
- Consistent hashing ring: A sorted data structure (e.g., a TreeMap) mapping token positions to node IDs.
- Node A, B, C...: Storage nodes, each owning multiple vnode arcs on the ring.
- Coordinator: Accepts requests from clients, looks up the ring to find the responsible node(s), and routes accordingly.
Request walkthrough (put):
- Client sends
PUT /keys/user:42to the coordinator. - Coordinator computes
token = hash("user:42"). - Coordinator binary-searches the ring for the first token position >= the key's hash.
- Coordinator routes the request to the owning node.
- Owning node stores the key-value pair and returns
200 OK.
Consistent hashing with vnodes is the partitioning strategy used by DynamoDB, Cassandra, and Riak. When an interviewer asks how you distribute data, this is the answer.
I find that drawing the ring with three or four labeled tokens and walking through one key lookup is worth more than five minutes of verbal explanation. Interviewers remember the visual.
3. Replication and failure handling
One replica means a node crash loses all data on that node. With 100 nodes in a cluster, you expect a node failure roughly every few days. Single replicas are not acceptable.
The naive approach: write only to the primary. A single primary per partition crashes and the key is unreachable until recovery. Recovery from a fresh replica resync can take minutes to hours depending on data size.
The fix: replication factor N=3. The coordinator sends each put to the primary node and two replica nodes. The three nodes for a given key are the first three clockwise nodes on the ring from the key's hash position. With N=3, the cluster tolerates one node failure without data loss.
Nodes detect each other's failures using a gossip protocol. Every node periodically selects K random peers and exchanges its current view of which nodes are alive. Failure information propagates in O(log N) gossip rounds, reaching all N nodes in seconds with no central coordinator.
Quorum writes and reads: Writing to all three replicas synchronously adds latency. Instead, use quorum writes: the coordinator considers a write successful when W replicas acknowledge it (W=2 by default). Quorum reads contact R replicas and return the value with the most recent version (R=2 by default). With N=3, W=2, R=2, at least one replica in any read quorum must have seen the latest write.
Components:
- Coordinator: Fans writes out to N=3 nodes; collects acknowledgments; returns success when W=2 respond.
- Primary Node + Replica 1 + Replica 2: Each stores the full key-value entry and its version.
- Gossip daemon: Running on every node, broadcasting node liveness within seconds.
Request walkthrough (put with replication):
- Client sends
PUT /keys/cart:user99to the coordinator. - Coordinator identifies the three ring nodes responsible for the key.
- Coordinator sends the write in parallel to all three nodes.
- Two nodes (W=2) acknowledge. Coordinator returns
200 OKto the client. - The third replica receives the write asynchronously and applies it.
The gossip protocol is the failure detector. When a node goes silent for more than a configurable heartbeat window (typically 5 seconds), its peers mark it as suspect and route around it. No single point of coordination required.
I'd mention gossip proactively when the interviewer asks about node failure. A centralized heartbeat monitor is the obvious first answer, but it introduces a single point of failure for your failure detector, which is ironic. Gossip eliminates that and scales to thousands of nodes.
4. Tunable consistency
Not every operation needs the same consistency guarantee. A shopping cart can tolerate brief inconsistency; a money balance probably cannot. Hardcoding a single consistency level forces you to either accept unsafe reads in critical paths or pay latency penalties everywhere.
The system exposes tunable consistency through the R and W quorum parameters. When R + W > N, the read quorum and write quorum must overlap by at least one replica. That overlap forces at least one replica in any read to have seen the latest write, giving strong consistency. When R + W ≤ N, reads can miss the latest write, giving eventual consistency.
With N=3, the common configurations are:
| Configuration | R | W | R+W | Consistency | Use case |
|---|---|---|---|---|---|
| High availability | 1 | 1 | 2 | Eventual | Session data, shopping carts |
| Balanced (default) | 2 | 2 | 4 | Strong | User profiles, inventory counts |
| Write-optimized | 1 | 3 | 4 | Strong | Audit logs, append-heavy |
| Read-optimized | 3 | 1 | 4 | Strong | Config reads, rarely updated |
The quorum rule in one sentence
R + W greater than N guarantees strong consistency because every read quorum and write quorum share at least one node, ensuring the latest write is always visible to reads.
For the default configuration, use N=3, W=2, R=2. This survives one node failure on both the read and write path and still provides strong consistency. Reserve R=1, W=1 only for data where brief staleness is explicitly acceptable.
I always write the quorum formula on the whiteboard and show the math for one specific example (N=3, W=2, R=2 means at least one node in the read set saw the latest write). Interviewers probe this, and having the arithmetic visible makes the answer bulletproof.
Potential Deep Dives
1. How do we store data efficiently on each node?
Three constraints drive the storage engine choice. Write throughput is 100K writes/sec distributed across nodes (roughly 1,000-3,000 writes/sec per node at scale). Data must survive node restarts (no purely in-memory structures). Compaction must not stall the write path.
The right storage engine for this system is an LSM tree. Write throughput is the primary constraint (100K writes/sec), and LSM trees are explicitly designed to make every write a sequential disk operation.
2. How do we route requests to the right node?
Three constraints drive the routing strategy. No single routing bottleneck (the coordinator cannot be the only entity that knows the ring). O(log N) key lookup per request. Minimal key redistribution when nodes join or leave.
Consistent hashing with vnodes is the routing strategy. The full ring state for a 100-node cluster with 150 tokens each fits in a few hundred KB of memory, making ring lookups local and fast on every coordinator.
3. How do we handle write conflicts and ensure durability?
Two constraints apply. Concurrent puts to the same key from two different clients must not silently lose data. All writes must survive a single node failure without a coordinator round-trip on recovery.
Vector clocks expose concurrent writes rather than silently discarding them. For a production key-value store handling mutable shared state, silent data loss is worse than the added client complexity of reconciliation.
I would frame the LWW vs vector clock choice as a product decision on the whiteboard: "LWW is the right default for 80% of use cases (counters, session data, sensor logs). Vector clocks matter when two users can modify the same key concurrently and both writes need to survive." That framing shows you understand the tradeoff is context-dependent, not absolute.
Final Architecture
After all three deep dives, the complete system looks like this:
Every design decision in this architecture is a direct response to a specific NFR: LSM trees answer the 100K writes/sec requirement, consistent hashing with vnodes answers the 10 TB horizontal scaling requirement, quorum replication answers the 99.99% availability requirement, and vector clocks answer the correctness requirement for concurrent writes.
Interview Cheat Sheet
- Always open with consistent hashing, not modulo. Modulo hashing (
hash(key) % N) remaps approximately 90% of keys when N changes. Consistent hashing moves only ~1/N keys when one node is added or removed. - Virtual nodes fix the uneven arc problem. One token per node produces exponentially distributed arc sizes. With 150 vnodes per physical node, arc size variance drops to roughly 8%, giving near-uniform load distribution across the cluster.
- State the quorum math as a formula. R + W greater than N guarantees strong consistency by ensuring every read quorum overlaps the latest write quorum by at least one replica. Default to N=3, W=2, R=2.
- LSM tree write path is always sequential. Writes go: WAL (append-only) then MemTable (in-memory sorted tree). When MemTable reaches ~64 MB, flush to an immutable SSTable via sequential disk write. No random I/O on the write path whatsoever.
- Read amplification is the cost of LSM trees. B-trees have lower read amplification (one O(log N) tree traversal per lookup). LSM trees may check multiple SSTables for a single key. Bloom filters bring this to near-zero for absent keys, but present keys in older levels still require disk seeks.
- Bloom filters are not optional in an LSM engine. Without them, every read miss touches every SSTable level on disk. One Bloom filter per SSTable, tuned to ~1% false positive rate, eliminates most unnecessary disk seeks.
- Compaction runs in the background and never blocks writes. Size-tiered compaction merges SSTables of similar size. Leveled compaction (used by RocksDB and Cassandra) reduces read amplification by keeping each level to a bounded total size. Both approaches delete tombstones and older versions of overwritten keys.
- Gossip for failure detection, not a central heartbeat master. Every node gossips to K random peers every second. A node silent for 5 consecutive windows is marked suspect. Information propagates to all N nodes in O(log N) gossip rounds, roughly 3-5 seconds in a 100-node cluster.
- Vector clocks surface concurrent writes; last-write-wins silently drops them. For mutable shared state (shopping carts, collaborative docs), use vector clocks and return siblings for client reconciliation. For immutable or monotonically increasing data (sensor readings, counters), LWW by HLC is simpler and sufficient.
- Hot key mitigation uses key suffix sharding. A key receiving 100x average traffic routes all that load to one node. Append a random suffix 0-9 to the key on write, spreading traffic across 10 nodes. On read, issue 10 parallel reads and merge. Pair with request hedging (send to a second replica if the first does not respond in 5ms) to cut p99 latency.
- Tunable consistency is the product differentiator. Operators choose R and W per workload: R=1, W=1 for maximum availability on session data; R=2, W=2 for strong consistency on inventory. The system enforces the math; the operator owns the tradeoff.
- Durability is not just replication. The WAL on each node ensures a write that crashes before flushing to SSTable is recovered on restart. Replication handles node failures; the WAL handles process crashes. Both are required.
- When the interviewer asks about node failure recovery: the gossip layer detects it in ~5 seconds, the coordinator routes requests to the remaining live nodes in the replication group, and hinted handoff queues writes destined for the failed node on the next-clockwise healthy node for replay once it recovers.