Cache stampede anti-pattern
Understand how a mass cache invalidation event triggers a write thunderstorm, why it differs from thundering herd, and how stale-while-revalidate and background refresh prevent it.
TL;DR
- A cache stampede is a write thunderstorm triggered by cache invalidation, not expiry. Something updates the underlying data, invalidates a cache key, and every reader tries to repopulate it simultaneously.
- Unlike thundering herd (expiry-driven), stampedes are often event-driven: a single database write broadcasts a cache delete to every consumer, who all race to re-read the DB.
- The three defences are: stale-while-revalidate (serve stale content while refreshing in background), write-through cache (update cache before invalidating), and event-driven background refresh (the write path publishes a "refresh needed" event, a single worker handles it).
- Stampedes are especially dangerous on hot write paths: a high-traffic product price update can invalidate a key that 50,000 readers are actively using.
The Problem
Your product catalog caches every category page for 10 minutes. A merchandiser updates a product price. Your application invalidates the cache key for that category. This is correct behaviour (you don't want stale prices).
But the price update happened at 11:58 a.m. on Black Friday, when 50,000 users are concurrently browsing that category. The moment the cache delete executes, all 50,000 in-flight requests get a cache miss. Every one calls SELECT * FROM products WHERE category_id = ?. Your database connection pool exhausts in milliseconds.
The merchandiser just accidentally DDoSed your own database by updating one price.
I've seen this exact scenario take down a checkout flow during a holiday sale. The fix took 20 minutes to deploy, but the site was down for 8 minutes before anyone even diagnosed the root cause. Everyone assumed it was a traffic spike, not a cache invalidation.
What makes stampedes particularly cruel: they happen because you're doing the right thing (keeping data fresh). Engineers who skip cache invalidation never experience stampedes, but they serve stale data instead. The challenge is finding the middle ground, which is what the fix section covers.
Here's the timeline of what happens under the hood:
T+0ms: Merchandiser clicks "Save" on price update
T+5ms: DB write completes
T+6ms: Application calls redis.DEL("category:electronics")
T+7ms: 50,000 in-flight readers get MISS
T+8ms: 50,000 SELECT queries hit the database
T+10ms: Connection pool exhausted (max 200 connections)
T+15ms: DB starts rejecting new connections
T+500ms: First query completes, cache repopulated
T+500ms: 49,800 queries already failed or timed out
Why It Happens
Stampedes emerge from the "invalidate on write" pattern, which sounds correct in isolation. Data changed, so the cache should reflect that. The problem isn't the invalidation itself; it's the gap it creates.
Hard delete creates a miss window. When you DEL a cache key, every concurrent reader instantly sees a miss. There's no grace period, no "serve stale while refreshing." The key is just gone.
Write timing is unpredictable. Unlike TTL expiry (which you can plan for), invalidation happens whenever someone writes. A merchandiser updating a price at peak traffic doesn't know they're about to trigger 50,000 DB queries.
Writes fan out to all readers. A single write event affects every reader of that cache key. If the key is popular, the amplification factor is enormous. One write becomes 50,000 reads.
The insidious part: every individual decision here is defensible. You should invalidate stale data. You should use cache-aside for simplicity. The problem only emerges under concurrent load, which is exactly when you can't afford it.
Stampede vs Thundering Herd
These are often confused, and the confusion costs teams real debugging time. The distinction matters because the mitigations differ.
I've been in incident calls where the on-call engineer diagnoses "thundering herd" and spends an hour adding mutex-on-miss, only to discover the root cause was an admin panel write that invalidated a hot key. The symptoms are identical (DB overload from duplicate reads), but mutex-on-miss doesn't help if the trigger is a DEL command, not a TTL expiry.
| Thundering Herd | Cache Stampede | |
|---|---|---|
| Trigger | TTL expiry (time-based) | Cache invalidation (event-based) |
| Cause | Key expires while under concurrent load | A write deletes a key while readers are active |
| Predictable timing | Somewhat (you know the TTL) | No, triggered by arbitrary writes |
| Best fix | Probabilistic early expiry, mutex-on-miss | Stale-while-revalidate, write-through |
How to Detect It
Stampedes look like thundering herds in their symptoms but differ in one key way: the timing is irregular. There's no TTL-interval sawtooth pattern.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.