Post-mortem: Slack message delays 2022
A post-mortem of a Slack incident where database connection pool exhaustion caused message delivery delays affecting millions of users, with lessons on connection pool sizing.
Incident Summary
Date: Composite analysis of Slack's documented connection pool incidents (2022 period) Duration: 1-4 hours per event, with the longest incident degrading message delivery for approximately 6 hours Systems affected: Message send/receive pipeline, workspace loading, channel list rendering, search, and file uploads Impact: Messages delayed from milliseconds to minutes. Users saw "sending..." indicators stuck for 30+ seconds. Some messages failed silently. Workspace loading times spiked from under 1 second to 10-15 seconds. Approximately 10 million daily active users experienced degradation. Root cause: Database connection pool exhaustion caused by a combination of a traffic spike and slow queries. When connection pools filled, new requests queued waiting for a connection. The queue grew faster than it drained, creating a cascading delay that spread from the database layer through the entire message delivery pipeline.
Connection pool exhaustion is, in my experience, the single most common cause of database-related production incidents, more common than replication lag, disk exhaustion, or primary failure. It is mundane and preventable, which is exactly why it catches teams off guard. The pool is sized for average conditions, tested under average conditions, and then deployed into a world where traffic spikes to 3x average on a Monday morning. The day the traffic hits that spike while a slow query is running is the day you discover your pool was never sized for reality.
I have seen connection pool exhaustion bring down systems at multiple companies. The pattern is always the same: something makes queries slow (a missing index, a lock, a replication lag), the pool fills up, requests queue, memory grows, GC pauses kick in, and the system enters a death spiral. The fix is always the same too: aggressive pool wait timeouts, circuit breakers, and automated slow query termination. Every team learns this lesson exactly once, the hard way.
What Happened: The Timeline
| Time | Event |
|---|---|
| 9:00 AM | Monday morning traffic surge begins; traffic reaches 2.5x weekend baseline |
| 9:15 AM | A slow query (unindexed JOIN) begins executing on a hot database shard |
| 9:20 AM | Average query latency on the affected shard rises from 2ms to 200ms |
| 9:25 AM | Connection pool on app servers hitting the affected shard reaches capacity (100/100 connections) |
| 9:28 AM | New requests begin queuing for available connections |
| 9:30 AM | Pool wait time alerts fire; message send p99 latency exceeds 5 seconds |
| 9:35 AM | Request queue depth grows; memory pressure increases from buffered requests |
| 9:40 AM | GC pauses begin on app servers, holding connections longer, further slowing pool drain |
| 9:45 AM | Message send failures begin; HTTP 503 error rate exceeds 5% |
| 10:00 AM | On-call engineer identifies and kills the slow query |
| 10:15 AM | Rolling restart of affected app servers to clear request queues and reset pools |
| ~10:30 AM | Message latency returns to normal levels; incident resolved |
The total impact was roughly 90 minutes of degraded message delivery, with about 30 minutes of that being severe (messages failing or delayed by over 30 seconds). For a platform where "real-time messaging" is the core product, even 30 seconds of delay breaks the user experience.
What users actually experienced. The symptoms were not uniform. Some workspaces (those on the affected shard) saw severe delays. Others were completely unaffected. Within affected workspaces, the experience depended on timing: messages sent right before pool exhaustion went through fine, messages sent during the peak saw 10-30 second delays, and messages sent after the queue filled up either failed silently (the client showed "sending..." for 30 seconds and then gave up) or succeeded with a noticeable lag.
The inconsistency made the incident harder to triage. Support received reports like "Slack is slow" alongside reports like "Slack is working fine for me." That is the signature of a shard-specific database issue: only a subset of users are affected, and the subset corresponds to a database partition that is invisible to the user.
Downstream cascading. Message delivery delays also affected:
- Slack bots and integrations: Webhook deliveries backed up because bots waiting for database reads to construct responses timed out. CI/CD notifications from GitHub, Jira ticket updates, and PagerDuty alerts were all delayed.
- Thread loading: Opening a thread requires reading message history. Slow reads meant threads took 10+ seconds to load, and some timed out entirely.
- Search indexing: The search indexer, which reads new messages to index them, fell behind. Recently sent messages did not appear in search results for 30+ minutes.
- Unread counts: Badge counts and unread markers depend on DB reads. During the incident, unread counts were incorrect, showing stale values. Some users had channels marked "unread" with no new messages, or had new messages with no unread indicator.
Shard-specific failures look random to users
When only one database shard is affected, only the workspaces assigned to that shard experience degradation. To users, this looks random: "Why is it slow for my team but my friend at another company says it is fine?" The answer is that different workspaces live on different shards. This is important for interviews: sharded databases isolate failure to the affected shard, which is a benefit of sharding. But it also makes diagnosis harder because the symptoms are inconsistent.
Slack's Message Delivery Architecture
To understand why connection pool exhaustion cascades so quickly, you need to see where the pool sits in the message delivery path.
Each app server maintains a fixed-size connection pool (e.g., 100 connections) to each database shard it communicates with. Under normal conditions, the pool is partially utilized: maybe 20-30 connections active at any moment, with the rest idle and ready.
The critical path for sending a message is:
- Client sends message via HTTPS
- App server checks out a connection from the pool
- App server executes an INSERT into the message table on the appropriate shard
- App server returns the connection to the pool
- App server enqueues a fanout job for real-time delivery
- Client receives acknowledgment
Steps 2-4 typically take 2-5ms. The connection is borrowed for that tiny window and returned. At 2ms per message and 100 connections, a single pool can theoretically support 50,000 messages per second. In practice, some queries take longer, so real throughput is lower, but the point is: under normal conditions, pool utilization is low and everything flows.
The critical insight for system design is Little's Law: the number of connections in use equals the arrival rate multiplied by the average service time. If you receive 1,000 queries per second and each takes 2ms, you need 2 connections. If each takes 200ms (due to a slow query), you need 200 connections. The pool size does not change, only the query latency changes. This means a 100x increase in query latency causes a 100x increase in pool utilization. Understanding this relationship is essential.
Little's Law for connection pools:
connections_in_use = arrival_rate * avg_query_time
Normal: 1000 q/s * 0.002s = 2 connections in use (2% of 100-connection pool)
Slow Q: 1000 q/s * 0.200s = 200 connections needed (200% of 100-connection pool β queue)
Spike: 5000 q/s * 0.002s = 10 connections in use (10% of pool, still fine)
Both: 5000 q/s * 0.200s = 1000 connections needed (1000% of pool β catastrophic queue)
The "both" scenario (traffic spike plus slow queries) is what hit Slack. Neither factor alone would have exhausted the pool. The combination was fatal.
Why fixed-size connection pools?
Database connections are expensive to create (TCP handshake, authentication, session setup) and expensive to maintain (each connection consumes memory and a thread on the MySQL server). Fixed-size pools bound the resource consumption on both sides: the app server limits how many connections it opens, and the database knows the maximum load it can receive from each client. Without pools, a traffic spike could open thousands of connections and overwhelm the database server itself.
Root Cause: Connection Pool Exhaustion
The root cause was a slow query on a hot database shard, combined with a traffic spike that exceeded the pool's capacity to absorb increased latency.
The slow query. A background job executed a JOIN that hit two large tables without a proper index. Under normal query load, this query would have been slow but not catastrophic (it might hold one connection for 30 seconds instead of the normal 2ms). But this query also held row-level locks that blocked subsequent INSERT operations on the same rows. Other messages destined for the same shard had their INSERTs delayed waiting for the lock to release.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.