Debugging latency spikes
P99 latency spikes in production almost always trace to a small set of causes: GC pauses, database lock contention, slow query plans, connection pool exhaustion, or downstream dependencies. Here's the systematic playbook.
The Problem Statement
Interviewer: "Your P99 latency just spiked from 120ms to 2 seconds in production. The P50 is still at 80ms. You have access to metrics, logs, and distributed traces. Walk me through exactly how you would figure out what caused this."
This question tests three things simultaneously. First, whether you understand why percentile-based metrics matter (the P50 vs P99 divergence is a clue, not just context). Second, whether you have a systematic triage process or whether you poke at random dashboards hoping for luck. Third, whether you know the actual technical causes of latency spikes and can reason about each one.
The follow-up questions get harder: "What if the traces show the spike is inside the database? What if the GC looks fine but latency is still high? What if you cannot reproduce it?"
Clarifying the Scenario
You: "Before I start diagnosing, I want to make sure I am looking at this correctly. A few quick questions."
You: "When you say P99 spiked to 2 seconds, is this per-endpoint or across all endpoints? A single slow endpoint can inflate the P99 for the whole service."
Interviewer: "Good catch. It is one specific endpoint, /api/search."
You: "Is the spike continuous, or is it periodic? Like every 30 seconds or every few minutes a batch of requests are slow?"
Interviewer: "It looks periodic. Every 2-3 minutes, a batch of requests spikes."
You: "That pattern is very helpful. Periodic spikes with a normal P50 almost always point to GC pauses, scheduled jobs, or cache TTL expirations. I will walk through the full investigation tree, but I would front-load my attention on those causes. Is this a JVM-based service?"
Interviewer: "Yes, Java/Spring Boot."
You: "Perfect. I will structure my answer in five layers: application-level causes, cache behavior, database contention, external dependencies, and infrastructure constraints like CPU throttling. The periodic pattern in a JVM service strongly implicates GC, so I will start there and show how to confirm or rule it out quickly. One more thing: do you have GC logging enabled?"
Interviewer: "Good question. Not currently."
You: "Noted. I will cover how to add that as part of the diagnostic. Without GC logs you are guessing, and adding them takes one JVM flag and a redeploy."
My Approach
I think about latency investigations as a five-layer triage, moving from the nearest cause to the farthest.
- Application layer: GC pauses, thread pool exhaustion, blocking I/O on the hot path, connection pool starvation.
- Cache layer: Cache stampedes on TTL expiry, cold cache after deployment, cache server resource pressure.
- Database layer: Slow queries, lock contention, connection pool exhaustion at the DB, autovacuum interference (Postgres).
- External dependencies: Slow third-party APIs, DNS resolution latency, TLS renegotiation overhead.
- Infrastructure layer: CPU throttling from cgroup limits, noisy neighbors on shared hardware, network packet loss causing TCP retransmits.
The P50 being normal while P99 spikes is critical context. It tells me that most requests are fine and only a fraction are affected. This rules out anything affecting all requests uniformly, like a database index dropping or a bad deployment. The periodic pattern in a JVM service is almost textbook GC.
The investigation time budget matters. In a production incident, I want to rule out the cheapest-to-check cause in under 5 minutes. GC logs and a quick check of pg_stat_activity take 2 minutes each. CPU throttling via cgroup metrics takes 1 minute. If none of those pan out, I move to distributed traces and start looking at individual slow spans. I do not jump to doing a full memory heap analysis or network packet capture in the first 30 minutes.
Understanding the Percentile Landscape
The choice of which percentile to alert on is not arbitrary. P50 (median) tells you about the typical user. P95 tells you about the unlucky user. P99 tells you about the users most likely to complain. P99.9 tells you about the users most likely to churn.
For a service processing 1,000 requests per second, P99 means 10 users per second are experiencing the bad latency. P99.9 means 1 user per second. Whether you care about P99 or P99.9 depends on your traffic volume and SLO. For a high-traffic consumer service, P99.9 is absolutely worth monitoring and alerting on.
The key insight is that P99 latency blends two distributions: a "fast" distribution (the overwhelming majority of requests) and a "slow" tail (the few requests that hit whatever the blocking condition is). When P50 is low and P99 is high, you are looking at a bimodal distribution. The investigation is about understanding what causes a request to fall into the slow distribution.
The Architecture
Here is the full investigation decision tree. Each node represents a diagnostic check and each arrow represents what you do depending on what you find.
The order in this decision tree is intentional. I start with GC and thread pools because they take 2 minutes to rule out and explain a large fraction of production latency incidents in JVM services. Infrastructure (CPU throttling, noisy neighbor) comes last because it requires more command-line work and explains a smaller fraction of cases. The tree reflects empirical frequency across incidents, not theoretical completeness.
The tree looks complex but the investigation itself follows a natural sequence. I start at the top (application layer) because it is cheapest to check and most likely to be the cause in JVM services. I move outward only when I have ruled out the inner layers.
The P50 normal / P99 spiking pattern is my first filter. If P50 was also spiking, I would skip directly to infrastructure and database, because that suggests something affecting all requests. A healthy P50 means the problem is in the tail of the distribution, which means it affects only some requests. That screams lock contention, GC pauses, or thread pool limits.
There is one rule that saves hours: before starting any investigation, confirm the spike happened and is not an artifact. Check if the alerting window changed, if a deployment just happened (which resets all percentile windows), or if a traffic pattern change is skewing the distribution. A P99 spike caused by a legitimate 10x traffic increase looks identical to a P99 spike from GC at first glance.
Average latency is useless for diagnosing spikes. A service with 50ms average and 2000ms P99 means 1 in 100 requests is 40x slower than average. At 1000 requests per second, that is 10 users per second experiencing 2-second waits. Always alert on P99 or P99.9, never on average.
Distinguishing GC Pauses from CPU Starvation
This is the single most common misdiagnosis I have seen in production systems. GC pauses look identical to CPU starvation in high-level metrics. Both manifest as periodic latency spikes with normal latency between spikes. The fix for each is completely different, so getting this wrong wastes hours.
The critical distinction is the GC safepoint mechanism. When the JVM initiates a GC pause, it must bring all application threads to a safepoint before it can safely run the collector. A thread in a tight native loop may take 50-150ms to reach the next safepoint check. During that window, no application progress happens, but the GC has not even started yet.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.