Split-brain anti-pattern
Learn why distributed systems with two primaries accepting conflicting writes are dangerous, how fencing tokens and STONITH prevent data loss, and why quorum alone isn't enough.
TL;DR
- Split-brain occurs when a network partition causes two nodes to each believe they are the only active primary, and both accept writes, creating divergent state that cannot be automatically reconciled.
- Even with majority quorum, a poorly timed partition can produce two primaries: one finishing its "last term" writes and one starting a new term.
- The consequences (duplicate orders, double payments, lost messages) are often silent and only discovered during reconciliation or auditing, sometimes days later.
- Prevention requires fencing tokens (monotonic IDs that reject stale writes), STONITH (forcibly killing the old primary), or a consensus protocol like Raft that structurally prevents two leaders in the same term.
The Problem
Your primary database fails a health check at 2:47 a.m. Your high-availability setup promotes the replica to primary. Normal so far.
But the original primary didn't actually fail. It experienced a 30-second network partition. Its health check packets were lost. It's still running, still accepting writes from one subset of application servers that can reach it. The replica is now also accepting writes from another subset. Both believe they are the authoritative primary.
I've seen this exact scenario take down an order pipeline for a retail company during a flash sale. For 90 seconds, the Orders table in "Primary A" received 400 order writes while "Primary B" received 600. When the network healed and both nodes reconnected, there were 1,000 divergent order records with overlapping auto-increment IDs.
You now have a data reconciliation problem with no automatic solution. Some orders exist in only one database. Some IDs were created twice for different orders. Your payment processor ran against one view; your fulfillment system ran against the other.
The bottom line: silent data divergence is worse than a visible outage. An outage triggers alerts. Split-brain corrupts data quietly.
What reconciliation actually looks like
When the partition heals, most HA systems detect the dual-primary situation and demote one node. But the damage is already done. You're left with two divergent datasets.
Reconciliation requires answering questions like:
- Order #1001 exists on both primaries with different customer IDs. Which is real?
- Payment for order #1001 was charged on Primary A's version. The customer on Primary B's version never paid. Do we charge them retroactively or cancel?
- Inventory was decremented on both primaries. The actual stock count is neither value.
- Analytics pipelines consumed data from both primaries. Reports generated during the window are wrong.
There's no generic tool that solves this. Each table needs domain-specific merge logic. Most teams end up writing one-off scripts, which themselves can introduce bugs if the merge logic has edge cases.
For the retail company I mentioned, the reconciliation strategy was:
- Orders table: Use the payment processor's transaction records as the source of truth. If a payment went through, the order is real regardless of which primary created it.
- Inventory table: Recount physical stock and adjust. Both primaries' counts were wrong.
- User sessions: Discard both and force re-login. Acceptable data loss for sessions.
- Analytics events: Deduplicate by event ID, keep earliest timestamp. Accept that some metrics for the split-brain window are approximate.
The reconciliation process itself needs to be idempotent and auditable. Document every merge decision so you can explain to the business team (or a regulator) why certain records were prioritized over others. In regulated industries, this documentation isn't optional; it's a compliance requirement.
Why It Happens
A common misconception: "We use majority quorum, so we can't have split-brain." Quorum ensures that reads always see the most recent write, and that a new leader can only be elected with majority support. But during the window between the old leader losing connectivity and the new leader being elected, the old leader may still be accepting writes from clients that can reach it.
The old leader doesn't know it has been demoted. It feels fully healthy. Unless you have a mechanism to explicitly tell it "stop accepting writes, you are no longer the leader" (or kill it outright), it will continue as a rogue primary.
This is the critical insight that most architecture discussions miss: the problem isn't the election; it's the demotion. Most HA systems are good at promoting a new leader. Very few are good at guaranteeing the old leader has stopped writing. The gap between these two operations is where split-brain lives.
Split-brain emerges from individually reasonable decisions:
- You enable automatic failover because manual promotion takes too long during outages.
- You set aggressive health check timeouts (5-10 seconds) because slow failover means downtime.
- You don't implement fencing because it adds complexity and most failovers work fine without it.
- You don't test network partitions because chaos engineering feels risky in production.
Each decision is defensible. The combination creates a system where a transient network blip triggers promotion while the old primary is still alive and writing.
Common trigger scenarios
Split-brain doesn't require a dramatic infrastructure failure. These mundane events trigger it regularly:
- Switch firmware upgrade: A top-of-rack switch reboots during a firmware update. The primary and HA controller are on different sides of the switch. The HA controller can't reach the primary for 15-30 seconds and promotes the replica.
- GC pause on the primary: A long garbage collection pause (Java, .NET) causes the primary to miss health check deadlines. The HA controller thinks it's dead. When the GC finishes, the primary resumes writing.
- DNS resolution failure: The HA controller resolves the primary's hostname to a stale IP after a cloud networking event. It can't reach the primary and promotes the replica, even though the primary is running fine.
- Asymmetric partition: The primary can reach the database clients but not the HA controller. From the clients' perspective, everything is fine. From the HA controller's perspective, the primary is dead.
The asymmetric partition is particularly dangerous because the primary appears healthy to its clients. Write traffic continues normally. The only sign of trouble is the HA controller promoting a replica. If your clients don't re-resolve the primary address after failover (common with connection pooling), they'll keep writing to the old primary indefinitely.
Here's the timeline that makes this so dangerous:
Be careful not to confuse the problem with the solution. Adding a tie-breaker node (arbiter) or a witness prevents election ambiguity, but does not prevent the old primary from accepting writes during the partition. Fencing the old primary is always required.
How to Detect It
Split-brain is insidious because both primaries look healthy when examined individually. You only see the divergence when you compare them. I've seen teams spend hours debugging "random data inconsistencies" before realizing they had a 45-second split-brain event three days earlier.
| Symptom | What It Means | How to Check |
|---|---|---|
Two nodes reporting role=primary | Active split-brain in progress | Query SELECT pg_is_in_recovery() on all nodes; two returning false = split-brain |
| Divergent row counts on same table | Writes went to different primaries | SELECT COUNT(*) FROM orders on both nodes after partition heals |
| Overlapping auto-increment IDs with different data | Both primaries assigned same IDs | SELECT id, data_hash FROM orders and compare across nodes |
| Replication lag suddenly drops to zero then spikes | Replica was promoted and now has its own write stream | Monitor pg_stat_replication or equivalent |
| Application logs show successful writes during a "failover window" | Old primary was still accepting writes | Correlate write timestamps across both nodes |
| Fencing token rejections in storage logs | Old primary attempted writes after demotion | Search storage logs for "stale fencing token" or "rejected" entries |
| Customers reporting conflicting data | Different app servers read from different primaries | Compare query results when routing to each node explicitly |
In my experience, the fastest detection method is a simple heartbeat: each primary writes a unique token to a shared external store (etcd, ZooKeeper) every second. If two different tokens appear for the same cluster, you have split-brain. Most teams discover split-brain retroactively during reconciliation, which is too late.
Proactive monitoring setup
The key insight: don't wait for data divergence to detect split-brain. Monitor for the preconditions.
// Pseudo-code: split-brain detector running on each node
async function checkForSplitBrain(): Promise<void> {
const myRole = await db.query("SELECT pg_is_in_recovery()");
const clusterPrimaries = await etcd.get("/cluster/primaries");
if (!myRole.isRecovery && clusterPrimaries.length > 1) {
await alerting.fire("SPLIT_BRAIN_DETECTED", {
severity: "critical",
primaries: clusterPrimaries,
message: "Multiple nodes reporting as primary",
});
// Optionally: self-fence if our token is lower
if (myFencingToken < Math.max(...clusterPrimaries.map(p => p.token))) {
await db.setReadOnly(true);
}
}
}
Set up alerts on:
- Two nodes with
primaryrole in the same cluster (immediate page) - Replication lag dropping to zero on a known replica (it may have been promoted)
- Fencing token conflicts at the storage layer (stale writes being rejected)
- Network partition duration exceeding failover timeout (split-brain is imminent)
The difference between a team that handles split-brain in minutes versus one that discovers it days later comes down to monitoring. Invest 2-3 hours setting up these alerts. You'll be glad you did.
What to do if split-brain is detected in progress
If your monitoring detects an active split-brain (two nodes reporting as primary), respond immediately:
- Identify the legitimate primary. Check which node has the higher fencing token or was most recently elected by the HA controller.
- Force the rogue primary to read-only. On PostgreSQL:
ALTER SYSTEM SET default_transaction_read_only = on; SELECT pg_reload_conf(); - Redirect all clients to the legitimate primary. Update DNS, service discovery, or load balancer configuration.
- Assess divergence. Compare row counts and recent write timestamps to quantify how many records diverged.
- Begin reconciliation. Use the strategies from "The Problem" section to merge divergent data.
The first three steps should happen within minutes. Steps 4 and 5 are the long tail and can take days.
The Fix
There are three main approaches to preventing split-brain, each suited to different system architectures. The right choice depends on your infrastructure, your availability requirements, and how much complexity you're willing to add.
Fix 1: Fencing tokens
Every lease or lock grant includes a monotonic token. Storage nodes reject any write carrying a token older than the most recently seen token. When a new primary is elected with token 42, the old primary's writes with token 41 are rejected at storage.
// Storage node validates fencing token on every write
async function handleWrite(request: WriteRequest): Promise<WriteResponse> {
if (request.fencingToken < this.highestSeenToken) {
return { status: "REJECTED", reason: "stale fencing token" };
}
this.highestSeenToken = request.fencingToken;
await this.storage.write(request.key, request.value);
return { status: "OK" };
}
Martin Kleppmann's "Designing Data-Intensive Applications" popularized this pattern as the correct way to handle leader transitions.
How it works with leases: The primary holds a lease (a time-limited lock) from a coordination service like ZooKeeper or etcd. The lease includes a monotonically increasing token. When the lease expires or the primary is fenced, the new primary gets a higher token. Even if the old primary doesn't know its lease expired, its writes are rejected at the storage layer because its token is stale.
The lease TTL is a critical tuning parameter. Too short (1-2 seconds) and normal GC pauses or network jitter cause spurious failovers. Too long (30+ seconds) and you have a wide split-brain window before the lease expires. Most production systems use 5-10 second leases as a balance.
// Lease renewal with fencing token
async function renewLease(): Promise<LeaseGrant> {
const lease = await etcd.grant({ ttl: 10 }); // 10-second lease
const token = lease.id; // Monotonically increasing
await etcd.put("/cluster/primary", nodeId, { lease: lease.id });
return { leaseId: lease.id, fencingToken: token };
}
// Primary must renew before TTL expires
setInterval(async () => {
try {
await etcd.leaseKeepAlive(currentLease.leaseId);
} catch (error) {
// Lease renewal failed: self-demote to read-only
await db.setReadOnly(true);
logger.error("Lease renewal failed, demoting to read-only");
}
}, 3000); // Renew every 3s for a 10s lease
Trade-off: Every storage node must track and validate tokens. Adds a small amount of write-path latency (microseconds) and requires all clients to include the token. If even one write path skips token validation, the fencing is bypassed.
Fix 2: STONITH (Shoot The Other Node In The Head)
When a new primary is elected, send a hard shutdown command to the old primary via an out-of-band channel (IPMI, iLO, PDU power cut). If the old primary can't be confirmed dead, the election is blocked. Better a momentary outage than a split-brain.
AWS RDS Multi-AZ uses a variant of this: the old primary is fenced at the network level (security group update) before the replica promotion completes.
In on-premises environments, STONITH is typically implemented via:
- IPMI/iLO/DRAC: Send a hardware-level power-off command to the old primary's BMC (baseboard management controller). Works even if the OS is hung.
- PDU power cut: Cut power to the old primary's rack position via a managed power distribution unit. The most reliable method but requires physical infrastructure access.
- SBD (STONITH Block Device): A shared disk that acts as a "poison pill." The old primary periodically reads the SBD; if its "slot" is marked as fenced, it self-terminates. Used in Pacemaker/Corosync clusters.
Trade-off: Requires out-of-band management access. If the management network is also partitioned, you're stuck. Some teams use a "poison pill" approach where the old primary self-terminates when its lease expires.
Fix 3: Raft-based consensus
Raft structurally prevents two leaders in the same term. A leader can only append to the log if it can confirm a majority of nodes acknowledge each entry. If it loses majority connectivity, it steps down. A new leader can only be elected by a majority.
Split-brain requires both a new leader to be elected AND the old leader to continue with majority connectivity, which contradicts the definition of a majority. Systems like etcd, CockroachDB, and TiKV use Raft for exactly this reason.
Trade-off: You must accept leader elections (brief unavailability, typically under 1 second in etcd, 150-600ms in default Raft configurations). You also need an odd number of nodes (3, 5, 7) to avoid tied elections.
Comparison of prevention mechanisms
| Mechanism | How it prevents split-brain | Failure mode | Latency impact | Complexity |
|---|---|---|---|---|
| Fencing tokens | Storage rejects stale writes | Bypassed if any write path skips validation | Microseconds per write | Low (add token to write path) |
| STONITH | Old primary is killed | Fails if management network is also partitioned | Zero (happens during failover) | Medium (requires out-of-band access) |
| Raft consensus | Structurally impossible (majority required) | Brief unavailability during elections | Sub-millisecond (local append) | High (requires Raft implementation) |
| Lease-based fencing | Old primary's lease expires, writes rejected | Old primary may write before lease check propagates | Depends on lease TTL | Medium |
| CRDTs | Divergence is acceptable, auto-merged | Merge semantics may not match business rules | Zero (no coordination) | Low-medium (limited data types) |
Real-world implementations
PostgreSQL + Patroni: Uses etcd for leader election and implements fencing via leader key TTL. The old primary checks its leader key before accepting writes; if the key doesn't belong to it, it demotes itself to read-only. This is a lease-based fencing approach.
MySQL Group Replication: Uses a Paxos-based group communication layer. A node that can't reach a majority of the group automatically switches to read-only mode. This is structurally similar to Raft.
Redis Sentinel: An imperfect solution. Sentinel elects a new primary, but clients that cached the old primary's address can still write to it until they refresh. Redis doesn't implement fencing tokens natively. This is why Redis Cluster (which uses a gossip protocol with majority agreement) is preferred for use cases where split-brain matters.
Amazon Aurora: Uses a shared storage layer with a single writer. The storage layer itself enforces that only one node can write at a time, which is a form of storage-level fencing. Failover promotes a replica to writer by updating the storage layer's writer ID.
MongoDB Replica Sets: Uses Raft-like consensus for elections. A node must receive votes from a majority to become primary. If the old primary is partitioned, it steps down when it can't reach a majority. However, clients that cached the old primary's connection may still send writes to it; the MongoDB driver's retryWrites feature handles this by retrying failed writes on the new primary.
For MongoDB specifically, writes to a partitioned old primary fail with a NotWritablePrimary error once the node steps down. But there's a brief window between the partition starting and the node detecting the loss of majority where writes succeed on the old primary. This is the same fundamental issue as other systems: the demotion is not instantaneous.
Google Spanner: Uses TrueTime (GPS-synchronized clocks) combined with Paxos for consensus. The clock synchronization bounds ensure that even during leader transitions, the system can order writes correctly without fencing tokens. This is a hardware-assisted solution that most teams can't replicate.
The common thread
All of these implementations solve the same problem: ensuring that the old leader stops writing before (or simultaneously with) the new leader starting. The mechanism differs (lease expiry, majority loss, storage-layer enforcement, clock bounds), but the principle is the same. Pick the mechanism that fits your infrastructure. If you're on AWS, Aurora's storage-level fencing is built in. If you're running your own PostgreSQL, Patroni with etcd gives you lease-based fencing. If you're building a new distributed system from scratch, embed Raft directly.
Which fix to use?
Severity and Blast Radius
Split-brain is one of the most severe distributed systems failures because the damage is invisible at the time it happens. Both primaries accept writes successfully. Clients get 200 OK responses. No alarms fire. The divergence only surfaces later, during reconciliation, auditing, or when a customer reports conflicting data. This is what makes it worse than a clean outage: an outage is loud and obvious; split-brain is silent and corrosive.
The blast radius fans out to every system that reads from the divergent primaries:
- Payments: One primary charged the customer; the other didn't. Or both did, resulting in a double-charge.
- Fulfillment: One primary created the shipping order; the other didn't. Customer gets charged but nothing ships.
- Analytics: Dashboards show different numbers depending on which replica they query. Business decisions made on corrupted data.
- Audit logs: Compliance reports become unreliable because the event timeline is forked.
Recovery difficulty: Hard. You can't just "pick one primary and discard the other" because both contain valid writes that the other is missing. Merging requires domain-specific conflict resolution logic that probably doesn't exist when you need it. For a 90-second split-brain window with 1,000 divergent writes, expect 2-5 days of manual reconciliation work. For longer windows or higher write throughput, multiply accordingly.
I once watched a team spend three weeks reconciling a 4-minute split-brain incident. The most time-consuming part wasn't the data merge; it was figuring out which downstream systems had consumed the divergent data and needed correction.
Split-brain severity depends on what was being written
Split-brain on a session cache is annoying. Split-brain on a payments table is a regulatory incident. Split-brain on a medical records database could be life-threatening. Assess severity based on the data being written, not just the infrastructure involved.
When It's Actually OK
Not every system needs to prevent split-brain. The severity depends entirely on what happens when two nodes diverge. Before adding fencing complexity, ask: "If both nodes accepted writes for 60 seconds, what's the worst business outcome?"
- Read-only replicas with no promotion: If replicas never get promoted to primary, split-brain can't happen. Stale reads are a different (lesser) problem.
- CRDT-based systems: Conflict-free replicated data types are designed for concurrent writes. Counters, sets, and LWW-registers merge automatically. You accept the convergence semantics by design. DynamoDB streams with CRDTs is a common implementation.
- Active-active with last-writer-wins: If your data model tolerates LWW semantics (session caches, view counters, non-critical metadata), two writers are acceptable. DynamoDB global tables use this approach.
- Dev/staging environments: Split-brain testing is actually valuable for validating your fencing implementation. Let it happen on purpose in non-production to verify your detection and prevention work correctly.
- Immutable append-only logs: If both primaries are only appending (never updating existing rows), reconciliation is much simpler: merge both logs and deduplicate. This is why event sourcing systems are more resilient to split-brain than mutable-state systems.
Testing for split-brain resilience
You won't know if your fencing works until you test it. Here's a practical approach:
- Simulate network partition: Use
iptablesrules ortc(traffic control) to block traffic between the primary and the HA controller while keeping the primary reachable from some app servers. - Verify promotion: Confirm the replica gets promoted to primary.
- Attempt writes to old primary: Send writes to the old primary from a client that can still reach it.
- Verify fencing: Check that writes to the old primary are rejected (fencing tokens) or that the old primary is unreachable (STONITH).
- Heal the partition: Remove the
iptablesrules and verify the cluster recovers to a single-primary state.
If step 4 fails (old primary accepts writes), your fencing implementation has a gap. Fix it before you need it at 2:47 a.m. on a Friday night.
Netflix's Chaos Monkey and similar chaos engineering tools automate this type of testing. For database-specific split-brain testing, tools like toxiproxy can simulate network partitions between specific hosts without affecting the rest of your infrastructure.
The key rule: test your fencing in staging before you trust it in production. A fencing implementation that hasn't been tested under partition conditions is an untested assumption, not a safety guarantee.
// Chaos test: verify fencing token enforcement
async function testFencingTokens(): Promise<void> {
// Simulate stale primary with old token
const staleToken = 41;
const currentToken = 42;
const result = await storage.write({
key: "test-key",
value: "stale-write",
fencingToken: staleToken,
});
assert(result.status === "REJECTED",
`Fencing broken: stale write accepted with token ${staleToken}`);
const validResult = await storage.write({
key: "test-key",
value: "valid-write",
fencingToken: currentToken,
});
assert(validResult.status === "OK",
"Valid write should be accepted");
}
- Active-active with last-writer-wins: If your data model tolerates LWW semantics (session caches, view counters, non-critical metadata), two writers are acceptable.
- Dev/staging environments: Split-brain testing is actually valuable for validating your fencing implementation. Let it happen on purpose in non-production.
How This Shows Up in Interviews
When your design has a primary and one or more replicas with leader election, interviewers ask: "What happens during a network partition between primary and replica?" The correct answer names the split-brain risk, explains why the old primary doesn't know it's been demoted, and describes fencing tokens or STONITH as the mechanism to prevent it.
This question is common in system design interviews for any system that uses primary-replica replication: databases, distributed caches, message brokers, even configuration stores. It separates candidates who understand replication from those who can reason about failure modes.
Saying "we use majority quorum" without addressing the old-primary rogue window is a common gap that costs candidates points. The interviewer wants to hear that you understand the difference between electing a new leader (which quorum handles) and stopping the old leader from writing (which requires fencing).
Silent data divergence is worse than visible downtime
A split-brain that goes undetected is a data integrity disaster. An outage is visible and recoverable. Conflicting writes in two databases can corrupt business data in ways that take weeks to find and months to reconcile. Always prefer a brief outage (fencing/STONITH) over the possibility of silent divergence.
A strong answer includes:
- Naming the split-brain risk explicitly when discussing failover
- Explaining why quorum alone doesn't prevent old-primary writes
- Describing at least one fencing mechanism (tokens, STONITH, or Raft)
- Mentioning that silent divergence is worse than downtime
Example interview phrasing
"For leader election, I'd use Raft-based consensus which structurally prevents split-brain. If we're using a traditional primary-replica setup with automatic failover, I'd implement fencing tokens so the storage layer rejects stale writes from a demoted primary. The key risk isn't the election itself, it's the old primary continuing to accept writes during the partition window."
Quick Recap
- Split-brain happens when two nodes both believe they are the primary and accept concurrent divergent writes.
- Network partitions trigger it; quorum prevents re-election ambiguity but doesn't stop the old primary from writing during the partition window.
- Adding an arbiter or witness prevents election ties, but does not prevent the old primary from accepting writes. Fencing is always required.
- Fencing tokens ensure stale writes are rejected at storage regardless of which node sent them. Every write path must validate the token.
- STONITH forcibly kills the old primary. If you can't confirm it's dead, don't promote the replica.
- Raft-based consensus structurally prevents split-brain by making log appends impossible without majority acknowledgement, at the cost of brief unavailability during elections.
- Silent data divergence is far worse than visible downtime, because corruption spreads to downstream systems before anyone notices.
- For active-active designs, use CRDTs or LWW semantics for data that tolerates it, and single-primary for data that doesn't.
- Test your fencing implementation under real partition conditions in staging. An untested fencing mechanism is an assumption, not a guarantee.