Two-phase commit coordinator crash anti-pattern

TL;DR

Two-phase commit (2PC) is a distributed transaction protocol where a coordinator asks all participants to "prepare to commit," then issues a final "commit" or "abort" instruction.
If the coordinator crashes after participants have responded "ready" but before it sends the final decision, participants are stuck: they've locked resources but can't commit or abort without hearing from the coordinator.
This is the blocking problem: participants hold locks for as long as the coordinator is down, which can be minutes to hours during an outage. A crashed coordinator can take down your entire transactional surface area.
Modern distributed systems replace 2PC with sagas (accept eventual consistency) or Raft (embed coordination directly in the participant state machines).

It's 3 a.m. and your payment service coordinator just crashed. Three microservices (inventory, payments, orders) have all voted "yes, I can commit" and are holding database locks. They're waiting for the coordinator to tell them whether to commit or abort. That coordinator isn't coming back for 47 minutes.

I've debugged this exact failure at a fintech startup. The inventory service held row-level locks on 200 SKUs. Every other checkout request that touched those SKUs queued behind the locks. Within 5 minutes, the connection pool was exhausted. Within 10, the entire checkout flow was dead, not just the transactions that were mid-2PC.

2PC has exactly two phases:

Phase 1 (Prepare): Coordinator sends "prepare" to all participants. Each participant writes to its local WAL, acquires locks, and responds "yes, I can commit" or "no, I cannot."

Phase 2 (Commit/Abort): If all participants said yes, coordinator sends "commit" to all. If any said no, coordinator sends "abort" to all.

The fatal window is the gap between Phase 1 completing and Phase 2 beginning:

Participants in the "prepared" state cannot unilaterally commit (they might be the only ones committing, violating atomicity) and cannot unilaterally abort (the coordinator might have already sent "commit" to others before crashing). They must wait.

The cascade effect

The blocking doesn't just affect the 2PC transaction itself. Those held locks block every other transaction that touches the same rows. Here's what the cascade looks like in practice:

Within minutes, the connection pool on the inventory service fills up with blocked queries. New requests can't get a connection. The entire service becomes unresponsive, even for requests that have nothing to do with the 2PC transaction.

The worst part: the monitoring dashboard shows the inventory service as "unhealthy" but gives no indication that the root cause is a 2PC coordinator crash in a completely different service. I've seen on-call engineers spend 20 minutes restarting the inventory service (which provides temporary relief until the connection pool fills again) before someone realizes the coordinator is the actual problem.

Why It Happens

Teams reach for 2PC because the requirement sounds simple: "I need this operation to be atomic across two services." That's a legitimate requirement. The problem is that 2PC solves it with a single-coordinator design that creates a blocking failure mode.

Here's how teams typically arrive at 2PC:

The database grew into multiple services. What was once a single-database transaction spanning two tables became a cross-service call when those tables were split into separate services. The team reaches for 2PC to preserve the atomicity they had before.
The ORM or framework makes it easy. Spring Boot with JTA, for example, makes distributed transactions look like local transactions with @Transactional. Teams don't realize they've introduced a coordinator until it fails.
Saga complexity feels disproportionate. For a simple two-step workflow, writing compensating transactions for both steps plus error handling feels like over-engineering. 2PC looks simpler (until the coordinator crashes).
The failure mode wasn't tested. Most 2PC implementations work fine 99.9% of the time. The blocking problem only manifests during coordinator failures, which teams often don't simulate in testing.

You might think: "Just have participants time out and abort." Unfortunately, this is unsafe. Consider:

Coordinator sent "commit" to A before crashing.
B and C waited, timed out, and aborted.
A committed. B and C aborted.
You now have a partial commit: money transferred on one side, not the other.

The only safe recovery is: coordinator restarts, reads its WAL for the transaction's state, and re-sends the final decision. The entire system is blocked waiting for that restart. If the coordinator's disk is lost, you may need manual intervention to determine which participants to commit and which to abort.

The partial commit scenario is the nightmare case:

This is why "just add a timeout" doesn't work. The timeout-and-abort approach creates a worse problem than the blocking it was trying to solve.

Why this is worse in practice than theory

2PC assumes the coordinator will eventually recover. In practice:

The coordinator's database might be on a failed disk with no recent backup.
Kubernetes might restart the coordinator on a new node without local WAL access.
Multi-datacenter deployments can hold locks across DCs during a DC-level outage.
The "just restart it" recovery takes minutes, during which all participating services are blocked.

A single 2PC coordinator failure can freeze every write operation in your system simultaneously. The blast radius scales with the number of services participating in 2PC transactions.

The math of cascading failure

Let's put concrete numbers on this. Assume:

3 services participate in 2PC transactions
Each service has a connection pool of 50 connections
2PC transactions take 200ms on average
The coordinator crashes with 10 prepared transactions in flight

Each prepared transaction holds 1 connection per participant (3 connections total across services). With 10 prepared transactions, that's 30 connections locked. But each locked row also blocks non-2PC queries that touch the same data.

If 20% of incoming requests touch a locked row, and request rate is 500/second:

100 requests/second queue behind locks
Connection pool fills in ~0.5 seconds per service
Within 5 minutes, all three services are returning 503 Service Unavailable
Any service that depends on these three services also starts failing

This is how a single coordinator crash takes down your entire system.

TL;DR

Two-phase commit (2PC) is a distributed transaction protocol where a coordinator asks all participants to "prepare to commit," then issues a final "commit" or "abort" instruction.
If the coordinator crashes after participants have responded "ready" but before it sends the final decision, participants are stuck: they've locked resources but can't commit or abort without hearing from the coordinator.
This is the blocking problem: participants hold locks for as long as the coordinator is down, which can be minutes to hours during an outage. A crashed coordinator can take down your entire transactional surface area.
Modern distributed systems replace 2PC with sagas (accept eventual consistency) or Raft (embed coordination directly in the participant state machines).

The Problem

2PC has exactly two phases:

Phase 1 (Prepare): Coordinator sends "prepare" to all participants. Each participant writes to its local WAL, acquires locks, and responds "yes, I can commit" or "no, I cannot."

Phase 2 (Commit/Abort): If all participants said yes, coordinator sends "commit" to all. If any said no, coordinator sends "abort" to all.

The fatal window is the gap between Phase 1 completing and Phase 2 beginning:

The cascade effect

The blocking doesn't just affect the 2PC transaction itself. Those held locks block every other transaction that touches the same rows. Here's what the cascade looks like in practice:

Why It Happens

Here's how teams typically arrive at 2PC:

The database grew into multiple services. What was once a single-database transaction spanning two tables became a cross-service call when those tables were split into separate services. The team reaches for 2PC to preserve the atomicity they had before.
The ORM or framework makes it easy. Spring Boot with JTA, for example, makes distributed transactions look like local transactions with @Transactional. Teams don't realize they've introduced a coordinator until it fails.
Saga complexity feels disproportionate. For a simple two-step workflow, writing compensating transactions for both steps plus error handling feels like over-engineering. 2PC looks simpler (until the coordinator crashes).
The failure mode wasn't tested. Most 2PC implementations work fine 99.9% of the time. The blocking problem only manifests during coordinator failures, which teams often don't simulate in testing.

You might think: "Just have participants time out and abort." Unfortunately, this is unsafe. Consider:

Coordinator sent "commit" to A before crashing.
B and C waited, timed out, and aborted.
A committed. B and C aborted.
You now have a partial commit: money transferred on one side, not the other.

The partial commit scenario is the nightmare case:

This is why "just add a timeout" doesn't work. The timeout-and-abort approach creates a worse problem than the blocking it was trying to solve.

Why this is worse in practice than theory

2PC assumes the coordinator will eventually recover. In practice:

The coordinator's database might be on a failed disk with no recent backup.
Kubernetes might restart the coordinator on a new node without local WAL access.
Multi-datacenter deployments can hold locks across DCs during a DC-level outage.
The "just restart it" recovery takes minutes, during which all participating services are blocked.

A single 2PC coordinator failure can freeze every write operation in your system simultaneously. The blast radius scales with the number of services participating in 2PC transactions.

The math of cascading failure

Let's put concrete numbers on this. Assume:

3 services participate in 2PC transactions
Each service has a connection pool of 50 connections
2PC transactions take 200ms on average
The coordinator crashes with 10 prepared transactions in flight

If 20% of incoming requests touch a locked row, and request rate is 500/second:

100 requests/second queue behind locks
Connection pool fills in ~0.5 seconds per service
Within 5 minutes, all three services are returning 503 Service Unavailable
Any service that depends on these three services also starts failing

This is how a single coordinator crash takes down your entire system.

Two-phase commit coordinator crash anti-pattern

TL;DR

The Problem

The cascade effect

Why It Happens

Why this is worse in practice than theory

The math of cascading failure

Continue Reading with Premium

Comments

Two-phase commit coordinator crash anti-pattern

TL;DR

The Problem

The cascade effect

Why It Happens

Why this is worse in practice than theory

The math of cascading failure

Continue Reading with Premium

Comments