Cache stampede anti-pattern
Understand how a mass cache invalidation event triggers a write thunderstorm, why it differs from thundering herd, and how stale-while-revalidate and background refresh prevent it.
TL;DR
- A cache stampede is a write thunderstorm triggered by cache invalidation, not expiry. Something updates the underlying data, invalidates a cache key, and every reader tries to repopulate it simultaneously.
- Unlike thundering herd (expiry-driven), stampedes are often event-driven: a single database write broadcasts a cache delete to every consumer, who all race to re-read the DB.
- The three defences are: stale-while-revalidate (serve stale content while refreshing in background), write-through cache (update cache before invalidating), and event-driven background refresh (the write path publishes a "refresh needed" event, a single worker handles it).
- Stampedes are especially dangerous on hot write paths: a high-traffic product price update can invalidate a key that 50,000 readers are actively using.
The Problem
Your product catalog caches every category page for 10 minutes. A merchandiser updates a product price. Your application invalidates the cache key for that category. This is correct behaviour (you don't want stale prices).
But the price update happened at 11:58 a.m. on Black Friday, when 50,000 users are concurrently browsing that category. The moment the cache delete executes, all 50,000 in-flight requests get a cache miss. Every one calls SELECT * FROM products WHERE category_id = ?. Your database connection pool exhausts in milliseconds.
The merchandiser just accidentally DDoSed your own database by updating one price.
I've seen this exact scenario take down a checkout flow during a holiday sale. The fix took 20 minutes to deploy, but the site was down for 8 minutes before anyone even diagnosed the root cause. Everyone assumed it was a traffic spike, not a cache invalidation.
What makes stampedes particularly cruel: they happen because you're doing the right thing (keeping data fresh). Engineers who skip cache invalidation never experience stampedes, but they serve stale data instead. The challenge is finding the middle ground, which is what the fix section covers.
Here's the timeline of what happens under the hood:
T+0ms: Merchandiser clicks "Save" on price update
T+5ms: DB write completes
T+6ms: Application calls redis.DEL("category:electronics")
T+7ms: 50,000 in-flight readers get MISS
T+8ms: 50,000 SELECT queries hit the database
T+10ms: Connection pool exhausted (max 200 connections)
T+15ms: DB starts rejecting new connections
T+500ms: First query completes, cache repopulated
T+500ms: 49,800 queries already failed or timed out
Why It Happens
Stampedes emerge from the "invalidate on write" pattern, which sounds correct in isolation. Data changed, so the cache should reflect that. The problem isn't the invalidation itself; it's the gap it creates.
Hard delete creates a miss window. When you DEL a cache key, every concurrent reader instantly sees a miss. There's no grace period, no "serve stale while refreshing." The key is just gone.
Write timing is unpredictable. Unlike TTL expiry (which you can plan for), invalidation happens whenever someone writes. A merchandiser updating a price at peak traffic doesn't know they're about to trigger 50,000 DB queries.
Writes fan out to all readers. A single write event affects every reader of that cache key. If the key is popular, the amplification factor is enormous. One write becomes 50,000 reads.
The insidious part: every individual decision here is defensible. You should invalidate stale data. You should use cache-aside for simplicity. The problem only emerges under concurrent load, which is exactly when you can't afford it.
Stampede vs Thundering Herd
These are often confused, and the confusion costs teams real debugging time. The distinction matters because the mitigations differ.
I've been in incident calls where the on-call engineer diagnoses "thundering herd" and spends an hour adding mutex-on-miss, only to discover the root cause was an admin panel write that invalidated a hot key. The symptoms are identical (DB overload from duplicate reads), but mutex-on-miss doesn't help if the trigger is a DEL command, not a TTL expiry.
| Thundering Herd | Cache Stampede | |
|---|---|---|
| Trigger | TTL expiry (time-based) | Cache invalidation (event-based) |
| Cause | Key expires while under concurrent load | A write deletes a key while readers are active |
| Predictable timing | Somewhat (you know the TTL) | No, triggered by arbitrary writes |
| Best fix | Probabilistic early expiry, mutex-on-miss | Stale-while-revalidate, write-through |
How to Detect It
Stampedes look like thundering herds in their symptoms but differ in one key way: the timing is irregular. There's no TTL-interval sawtooth pattern.
| Symptom | What It Means | How to Check |
|---|---|---|
| DB spike immediately after a write operation | Write-triggered invalidation stampede | Correlate DB CPU spikes with your write/update audit log |
Cache DEL commands spike alongside DB reads | Hard invalidation causing mass misses | Monitor Redis DEL commands per second alongside keyspace_misses |
| Irregular DB spikes (no periodic pattern) | Event-driven, not TTL-driven | If spikes don't repeat on a fixed interval, it's a stampede, not a herd |
| Write API latency is fine but read API latency spikes | Reads are the victims, not the writes | Compare P99 latency between your write and read endpoints |
pg_stat_activity shows many identical SELECT queries after an UPDATE | All readers racing to repopulate | Query pg_stat_activity during a spike, filter by query text |
The distinguishing signal: if DB spikes correlate with cache DEL events rather than TTL expirations, you have a stampede. Check your cache invalidation logs.
One quick diagnostic I run on every new system: enable Redis MONITOR for 60 seconds during a known write event, and count the DEL commands followed by GET miss patterns. If you see a DEL immediately followed by hundreds of identical GET misses, that's your stampede.
# Capture Redis commands during a write event
redis-cli MONITOR | grep -E "DEL|GET" | head -500
# What you're looking for:
# 1617981234.567890 "DEL" "category:electronics" <- the invalidation
# 1617981234.568001 "GET" "category:electronics" <- miss #1
# 1617981234.568002 "GET" "category:electronics" <- miss #2
# 1617981234.568003 "GET" "category:electronics" <- miss #3...
# (hundreds of GETs in the same millisecond)
Quick distinction
Thundering herd: sawtooth pattern, periodic spikes matching TTL. Cache stampede: irregular spikes correlating with write events. If you graph both DB CPU and cache DEL commands on the same timeline, stampede shows clear cause-and-effect alignment.
The Fix
Fix 1: Stale-while-revalidate
Serve the slightly stale cached value while asynchronously fetching a fresh one. The first reader who detects the staleness triggers a background refresh; every other reader gets the stale value immediately.
The approach requires changing your invalidation strategy. Instead of deleting the key, you mark it as stale. The key stays in cache (so readers still get a hit), but the metadata signals "this data needs refreshing."
async function getWithStaleWhileRevalidate(key: string): Promise<Value> {
const entry = await cache.get(key);
if (!entry) {
// True cold miss: fetch synchronously
return await fetchAndCache(key);
}
if (entry.isStale && !refreshInProgress.has(key)) {
// Serve stale, trigger background refresh
refreshInProgress.add(key);
fetchAndCache(key).finally(() => refreshInProgress.delete(key));
}
return entry.value; // return stale immediately
}
HTTP also supports this natively: Cache-Control: max-age=60, stale-while-revalidate=300.
Trade-off: You're serving slightly stale data during the refresh window (typically a few hundred milliseconds). For most read-heavy applications, this trade-off is excellent. Users see a response instantly (the stale value), and freshness catches up asynchronously.
Implementation note: The refreshInProgress set above is in-process only. If you have multiple application instances, each one might trigger a refresh independently. That's still far better than 50,000 simultaneous misses, but for perfect deduplication across instances, you'd need a distributed flag in Redis (similar to mutex-on-miss from the thundering herd article).
Fix 2: Write-through cache (update, don't delete)
Instead of deleting the cache key on a write, overwrite it with the fresh value. The cache never has a miss window. The write path becomes: write to DB, then write the updated value to cache.
This is the most intuitive fix once you understand the problem. The question shifts from "how do I handle the miss storm?" to "how do I avoid the miss entirely?" The answer: don't create a miss. Replace the value in-place.
async function updatePrice(productId: string, newPrice: number): Promise<void> {
await db.query("UPDATE products SET price = ? WHERE id = ?", [newPrice, productId]);
// Write through: don't delete, replace
const updated = await db.query("SELECT * FROM products WHERE id = ?", [productId]);
await cache.set(`product:${productId}`, updated, { ttl: 600 });
}
Trade-off: The write path now has a cache dependency. If the cache write fails after the DB write, you have inconsistent data. Use a transaction wrapper or accept eventual consistency with a short TTL safety net.
Pro tip: Always set a TTL even on write-through keys. If the write-through fails silently (network issue, Redis memory full), the TTL ensures the key eventually expires and gets refreshed. I've seen production systems serve month-old data because a write-through failed once and nobody noticed.
Fix 3: Event-driven background refresh
The write path publishes a "data changed" event. A dedicated refresh worker consumes the event and updates the cache. Readers always see cached data; freshness is eventual but the stampede is eliminated.
Write path: DB write → publish event("product.updated", { productId })
Cache worker: consume event → fetch from DB → update cache
Readers: always read from cache (may be slightly stale)
Here's a more concrete implementation using a message queue:
// Write path: publish event instead of deleting cache
async function updatePrice(productId: string, newPrice: number): Promise<void> {
await db.query("UPDATE products SET price = ? WHERE id = ?", [newPrice, productId]);
await messageQueue.publish("cache.refresh", {
key: `product:${productId}`,
source: "products",
id: productId,
});
}
// Dedicated cache refresh worker (single instance)
messageQueue.subscribe("cache.refresh", async (event) => {
const freshData = await db.query(
"SELECT * FROM products WHERE id = ?",
[event.id]
);
await cache.set(event.key, freshData, { ttl: 600 });
});
This is the cleanest solution for high-write systems but requires the infrastructure (a message queue, a worker) and accepting eventual consistency.
Trade-off: Readers may see data that's a few hundred milliseconds stale (the time between the DB write and the cache worker processing the event). For most read-heavy systems, this is a great trade. For systems where readers must never see stale data (financial balances, inventory counts), combine this with a short TTL backstop.
Scaling note: The cache refresh worker should be a single consumer (or use consumer groups with deduplication) to avoid multiple workers racing to refresh the same key. If the worker itself becomes a bottleneck, use a queue with priority: hot keys (high read rate) get refreshed first, cold keys can wait.
How Write-Through Eliminates the Miss Window
The key difference: instead of delete-then-miss-then-refill, you get update-in-place with zero gap.
Choosing Between Fixes
Use this decision tree to pick the right mitigation for your situation:
Severity and Blast Radius
Cache stampedes are high-severity because they're unpredictable and correlated with peak activity.
- Blast radius: All read traffic for the affected cache key. If the key is a category page or a product listing, that could be your entire storefront.
- Cascade risk: Very high. Unlike thundering herd (which is periodic), stampedes hit when real data changes happen, often during the busiest periods (price updates during sales, inventory changes during launches).
- Recovery time: 1-5 minutes if you catch it quickly. The cache repopulates once the DB stabilizes. But if the write that caused the stampede keeps happening (batch price updates), you get repeated stampedes.
- Amplification factor: A single write can trigger N simultaneous reads, where N is the number of concurrent users of that cache key. For popular keys, N can be 10,000-100,000.
- Compounding writes: In systems where multiple writes happen in bursts (batch updates, bulk imports), each write triggers its own stampede. 100 price updates in 1 second means 100 cache invalidations, each potentially causing thousands of DB reads. The total load isn't additive, it's multiplicative.
- Silent until it isn't: Like thundering herd, stampedes can lurk in your system for months. At low traffic, the miss-and-refill cycle completes before the next reader arrives. It only becomes a stampede when traffic crosses a threshold, which usually coincides with your most important business moments.
When It's Actually OK
- Low-read keys: If a cache key serves under 100 QPS, the "stampede" is a handful of DB queries. Not worth the complexity of write-through.
- Infrequent writes: If the underlying data changes once a day, the stampede is a rare event. A momentary DB spike once daily is often acceptable.
- Writes happen during off-peak hours: If your batch update job runs at 3 a.m. when traffic is 1% of peak, the miss window affects almost nobody.
- Short recompute time with DB headroom: If the query takes 5ms and your DB can handle 10x normal load, the stampede resolves before users notice.
The key question to ask: "If this cache key disappears right now, during our highest traffic hour, would the database survive?" If the answer is yes, you can live with hard deletes. If the answer is "I'm not sure," add write-through.
I worked on a system where we tracked every cache invalidation in a spreadsheet for two weeks. The result: 3 out of 200 cache keys accounted for 95% of the stampede risk. We added write-through only to those 3 keys and left the rest with hard deletes. Targeted fixes beat blanket complexity.
How This Shows Up in Interviews
Interviewers test stampede awareness when your design has both a cache and a write path that can invalidate it. When you mention "invalidate on write," follow immediately with "and to avoid a stampede, I'd use write-through or stale-while-revalidate rather than a hard delete."
At senior level, you should explain why DEL is dangerous under concurrency and propose write-through as the default. At staff level, discuss the trade-offs between write-through (write path complexity), stale-while-revalidate (brief staleness), and event-driven refresh (infrastructure cost). Mention that the choice depends on your consistency requirements and write frequency.
A strong answer distinguishes stampede from thundering herd by trigger type: "Thundering herd is time-triggered (TTL expiry), stampede is event-triggered (write invalidation). The fixes overlap but the root causes are different."
The safe default
When in doubt, use stale-while-revalidate. It keeps your cache hit rate high, eliminates the miss window, and the slight staleness is acceptable in most read-heavy systems.
Quick Recap
- Stampedes are event-driven: a write invalidates a popular cache key while thousands of readers are active.
- The result is a mass simultaneous cache miss, identical to thundering herd in impact but different in trigger.
- Hard-delete-on-write is the root cause pattern. Replace it with write-through or stale-while-revalidate.
- For high-write, high-read systems, event-driven background refresh fully decouples the write and read paths.
- Always ask in a design: "What happens to this cache key when the underlying data changes?"
- The simplest rule: never
DELa hot cache key under load. Update it, mark it stale, or let a background worker refresh it. - Combine write-through with a safety-net TTL for belt-and-suspenders reliability. If the write-through fails, the TTL catches it.