Consistent hashing

The problem with modulo hashing

The obvious way to distribute keys across N nodes is node = hash(key) % N. Simple. Works fine when your fleet is static.

Add one node. N becomes N+1. Now hash(key) % (N+1) produces different results for almost every key. Every key maps to a different node than before. In a 10-node cache cluster, adding an 11th node invalidates ~91% of cached entries simultaneously. Every cache miss goes to your database. At 3 a.m. after a scale-out operation, you've just triggered a thundering herd against your database from a routine operational change.

Remove one node. Same problem in reverse. The redistribution is total and immediate.

This is why you need consistent hashing.

How consistent hashing works

Consistent hashing maps both keys and nodes onto the same abstract ring. The ring runs from 0 to 2^32 - 1 (or some large number). Every node is assigned a position on the ring by hashing its identifier (hostname, IP, node ID). Every key is also hashed to a position on the ring.

Lookup rule: For a given key, walk clockwise around the ring until you hit the first node. That node owns the key.

Adding a node: Place the new node on the ring. Only keys between the new node and its predecessor on the ring need to move to the new node. All other keys stay put.

Removing a node: Keys on the removed node move to its successor. All other keys are unaffected.

With N nodes, adding or removing one node moves only 1/N of the keyspace. That's the property that makes consistent hashing valuable.

// Consistent hashing lookup
ring = sorted_map()   // position → node_id

function add_node(node_id):
  pos = hash(node_id)
  ring[pos] = node_id

function get_node(key):
  pos = hash(key)
  // Find the first node at or after pos (clockwise)
  entry = ring.ceiling(pos)
  if entry is None:
    entry = ring.first()   // wrap around
  return entry.node_id

The problem with naive consistent hashing: hot spots

With 4 nodes placed by hashing their IDs, you have 4 arc segments of the ring, each covering a different fraction of the key space. By probability, those arcs are unlikely to be equal. One node might own 40% of the ring, another only 10%.

Unequal arc ownership means unequal load. The node owning 40% of keys handles 4x the traffic of the one owning 10%. This is a real problem in production: a small number of nodes become hot while others are underutilized.

A second problem: when a node fails, all its load shifts to its single clockwise neighbor. That neighbor, which was already handling its share, now handles double. It may become overwhelmed and fail too, cascading the failure.

Virtual nodes solve uneven distribution

Instead of placing each physical node once on the ring, place it many times at many positions. Each placement is a "virtual node" (vnode). A typical configuration uses 100-300 vnodes per physical node.

// Virtual nodes
VNODES_PER_NODE = 150

function add_node(node_id):
  for i in range(VNODES_PER_NODE):
    vnode_key = f"{node_id}:vnode:{i}"
    pos = hash(vnode_key)
    ring[pos] = node_id   // physical node owns this position

The ring with virtual nodes

With 150 vnodes per node and 4 nodes, the ring has 600 entries. Each physical node owns ~150 ring positions spread around the full circumference. The law of large numbers ensures each node covers roughly 25% of the keyspace regardless of the hash values of node IDs.

Benefits of vnodes:

Balanced load. With enough vnodes, each node handles nearly equal key share.
Graceful failure. When a node is removed, its 150 vnodes are distributed across all other working nodes. Load spreads evenly instead of doubling on one neighbor.
Heterogeneous hardware. Give a more powerful node more vnodes. It proportionally handles more keys.

Without vnodes, one node may own 40% of the ring

With 4 nodes placed by a single hash of their IDs, the ring arcs are statistically unequal. In a real production case I have seen, one node owned 38% of the keyspace and handled 4x the load of another node owning 9%. Using 150 vnodes per node brings each node's share to within a few percent of 25%. This is not optional at production scale.

How Cassandra uses consistent hashing

Cassandra uses a consistent hash ring called the "token ring." Each node owns a set of token ranges. Data is written to the node whose token range includes the partition key's hash.

Replication factor (RF) determines how many nodes store each key. With RF=3, a write goes to the first 3 clockwise nodes from the key's hash position. Reads can be satisfied by any replica, with the consistency level controlling how many must agree.

Key: user_id=42
Hash: 1,847,234,902
Ring position: lands in Node B's token range
RF=3: write to Node B, Node C, Node D (next two clockwise)

When a node is added, Cassandra needs to "stream" (transfer) the keys that now hash to the new node. With vnodes, this streaming is parallel: the new node receives slices from many different existing nodes simultaneously instead of a single large transfer from one.

Where consistent hashing is used

System	Use
Cassandra	Partition key → token ring → replica set
Amazon DynamoDB	Internal partition routing
Memcached (with client-side library)	Client-side key-to-server routing
Nginx (consistent hashing upstream)	`hash $request_uri consistent` for sticky routing
Couchbase	Replaces static vBucket map with token-based routing
CDN edge node selection	Some CDN implementations route cache miss origin pulls via consistent hash to reduce origin fan-out

When consistent hashing is the wrong tool

Consistent hashing does not solve:

Skewed access patterns (hot keys). If 90% of requests are for the same 1% of keys, consistent hashing does nothing. Those keys still land on the same node. Use read replicas, local caching, or key spreading (appending a random suffix and scatter-gathering).
Range queries. Consistent hashing distributes keys by hash, destroying natural key ordering. Range scans require querying all nodes and merging results.
Small key spaces. With fewer than a few thousand distinct keys, consistent hashing has no advantage over a static partition table.