Gossip protocol
Learn how gossip protocols propagate cluster state to every node in O(log N) rounds without a central coordinator, and how Cassandra and Consul use it for failure detection and membership.
The problem
A 100-node Cassandra cluster needs every node to know which other nodes are alive or dead. The naive approach: a central coordinator collects heartbeats from all 99 other nodes every second. The coordinator checks for missed heartbeats and marks nodes as down.
At 100 nodes: 99 heartbeats/second arriving at the coordinator. Manageable. At 1,000 nodes: 999 heartbeats/second. Still fine. At 10,000 nodes: 9,999 heartbeats/second, plus the coordinator must track state for 10,000 nodes, maintain 10,000 TCP connections, and respond to every other node asking "is node X alive?" Under coordinator failure, you have lost the entire membership view. The coordinator is a single point of failure.
The scaling math rules out the coordinator model: O(N) load on one machine grows linearly with cluster size, and that one machine eventually becomes the bottleneck or the source of catastrophic failure. You cannot distribute the coordinator's load without rebuilding the system from scratch.
Gossip protocols solve this without any coordinator at all.
What gossip protocol is
A gossip protocol (also called an epidemic protocol) is a distributed information dissemination mechanism where each node periodically selects a small number of random peers, exchanges state with them, and each recipient further propagates the information to other random peers. Information spreads through the cluster like a rumor through a crowd: each person tells a few others, and those others tell a few more.
Think of a disease spreading through a population. One infected person infects an average of 3 others in the first round. Those 3 infect 3 more each (9 total now aware). Those 9 infect 27 more. After just a few doubling rounds, the entire population knows. The spread is logarithmic in time relative to population size, and no central broadcaster is needed.
How gossip works
Each node maintains a membership table: a list of all nodes it knows about, along with a state for each (alive, suspected, or dead) and a monotonically increasing heartbeat counter. Every second (configurable), each node picks 2-3 random peers and pushes its entire membership table to them. Recipients merge the received table with their own, keeping the highest heartbeat counter for each node.
The D2 diagram below traces the same exponential spread at the individual node level, making the fanout structure explicit:
spawnSync d2 ENOENT
With N nodes and fanout F (number of peers each node gossips to per round), the state reaches all nodes in roughly ceil(log(N) / log(F)) rounds. For a 100-node cluster with fanout 3, that is ceil(log(100) / log(3)) = ceil(4.19) = 5 rounds. Each round takes about 1 second, so every node knows within 5 seconds.
For 1,000 nodes with fanout 3: ceil(log(1000) / log(3)) = ceil(6.29) = 7 rounds (7 seconds). For 10,000 nodes: 9 rounds. This is the power of the logarithmic convergence: cluster size grows 100x but propagation time grows by only a few seconds.
// Pseudocode: one gossip round from node A's perspective
function gossip_round():
peers = random_sample(all_known_nodes, count=FANOUT) // pick 3 random peers
for peer in peers:
send(peer, my_membership_table)
function on_receive_gossip(table_from_peer):
for each entry in table_from_peer:
local = my_membership_table[entry.node_id]
if entry.heartbeat > local.heartbeat:
// peer has fresher info; update our record
my_membership_table[entry.node_id] = entry
// Separately, each node increments its own heartbeat every second:
function heartbeat_tick():
my_membership_table[MY_NODE_ID].heartbeat += 1
The merge rule is simple: always keep the highest heartbeat counter seen. If a node stops incrementing (because it crashed), its heartbeat counter stays flat while the others' counters increase. Eventually the counter is so stale that other nodes decide the node is suspected.
PHI accrual failure detection
Simple heartbeat-based failure detection uses a fixed timeout: "if I have not heard from node X in 10 seconds, it is dead." This is fragile. A brief network hiccup causing a 10.5-second gap falsely marks a healthy node as dead. A high-latency link might make a healthy node appear intermittently dead.
Cassandra uses PHI accrual failure detection, which is adaptive rather than threshold-based. Instead of a fixed deadline, it models the heartbeat arrival distribution statistically and outputs a continuous suspicion level called PHI.
PHI is calculated as:
PHI = -log10(P_later)
where P_later is the probability that the next heartbeat arrives
later than the current time, given the historical arrival distribution.
In plain terms: if node X has been sending heartbeats regularly every 1,000 ms plus or minus 50 ms (normal variance), and it is now 2,000 ms since the last heartbeat, that is statistically extremely unlikely under normal conditions. PHI climbs to indicate high suspicion. At PHI = 8, Cassandra marks the node as down.
| PHI value | Interpretation |
|---|---|
| 0-1 | Node almost certainly alive |
| 3-5 | Something odd; watch it |
| 8 | Cassandra marks it DOWN |
| 10+ | Cassandra begins handoff procedures |
The adaptive behavior: if a node's heartbeat variance naturally increases (e.g., due to GC pauses or network jitter), the model adapts. The threshold is not "8 seconds" but "statistically improbable given this node's behavior." A node with high natural variance is not falsely downed during a normal GC pause.
Anti-entropy repair vs rumor mongering
Gossip is used for two distinct purposes in distributed systems, and the protocols differ:
Rumor mongering (event-driven): A node receives new information (e.g., node 7 just died). It starts "gossiping" this specific event to random peers. Each peer that has not seen this event re-gossips it. The event propagates in O(log N) rounds. This is fast but lossy: if a node is partitioned during the gossip storm, it may miss the event and never receive it.
Anti-entropy (state synchronization): Nodes periodically exchange their entire state or a compact summary (e.g., a hash of their data). Differences are identified and reconciled. This is slower (O(N) data to exchange in the worst case) but eventually correct: even a node that was partitioned for hours will catch up when it reconnects. Cassandra uses Merkle trees for anti-entropy repair to compare which data is out of sync without exchanging the entire dataset.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.