Latency vs. throughput
The difference between latency (response time for a single request) and throughput (requests per second capacity), why optimizing for one often hurts the other, and how to balance them for your workload.
TL;DR
| Scenario | Optimize for latency | Optimize for throughput |
|---|---|---|
| User-facing API | Response under 100ms at p99 | Not the priority; fast enough is fine |
| Batch data pipeline | Not the priority; overnight is fine | Maximize rows/second processed |
| Payment processing | Sub-200ms to confirm; users are staring | Moderate; not millions per second |
| IoT/telemetry ingestion | Seconds-old data is acceptable | Maximize events/second per node |
| Real-time bidding (RTB) | Under 10ms hard cutoff | High, but latency is the constraint |
Default instinct: optimize for latency first, then find throughput. Latency is harder to fix retroactively. You can always batch more to raise throughput, but you cannot un-batch to recover latency.
The Framing
You are on-call at 9 AM. Your API's p50 latency is 40ms (looks fine on the dashboard). But users are complaining about "slow pages." You check p99: 1,200ms. One in every hundred requests takes over a second, and because a single page load fans out to 8 backend calls, the probability that at least one call hits p99 is 1 - (0.99)^8 = 7.7%. Almost one in twelve page loads feels broken.
This is the tension between latency and throughput. You can serve more total requests per second by batching, queuing, and amortizing overhead. But every optimization that amortizes work across requests adds waiting time to individual requests. The system processes more work per unit time, but each unit of work takes longer to complete.
I have seen teams spend weeks optimizing mean latency only to discover their p99 got worse. Mean latency is a vanity metric. Percentiles are where the pain lives.
Queueing theory predicts this: as server utilization approaches 100%, latency goes to infinity. The formula from an M/M/1 queue model is avg_wait = service_time / (1 - utilization). At 50% utilization, average wait is 2x service time. At 90%, it is 10x. At 99%, it is 100x. This is not linear. It is a hockey stick, and every production system sits somewhere on this curve.
The takeaway for your interview: capacity planning is latency planning. If you want low p99, keep utilization under 70%. If you want maximum throughput, you accept latency will spike.
The 70% utilization rule
Most production incidents I have investigated trace back to systems running above 80% utilization. At that point, any traffic spike (a retry storm, a cache flush, a deployment) pushes utilization past 90%, and latency explodes. Budget 30% headroom for spikes. If your steady-state is above 70%, you are one bad minute away from a page.
How Each Works
Latency: the time a single request waits
Latency measures the wall-clock time from when a request enters the system to when the response leaves. The key distinction is between percentiles.
p50 (median): Half of requests are faster. Your "typical" user.
p95: 5% of requests are slower. Your "slightly unlucky" user.
p99: 1% of requests are slower. Your "worst normal-case" user.
p99.9: One in a thousand. Dominated by GC pauses, cold caches, retries.
The gap between p50 and p99 is where systems break. A p50 of 40ms with a p99 of 1200ms means a bimodal distribution, often caused by cache misses, GC pauses, or lock contention on the slow path. I always ask teams: "What is your p99?" If they only know the mean, they do not know their latency.
Mean latency hides everything
A mean of 100ms could be 99% of requests at 10ms and 1% at 9,010ms. The mean looks fine. The experience is terrible. Always measure and alert on percentiles, never the mean.
Throughput: the work the system completes per unit time
Throughput measures operations per second, transactions per second, bytes per second, or messages per second. It answers: how much total work can the system do?
Throughput improves when you reduce per-request overhead. Batching, compression, connection pooling, and multiplexing all improve throughput by amortizing fixed costs across multiple operations.
Example: Database writes
Individual inserts: 1,000 writes/sec (1ms per write, TCP round-trip each time)
Batched inserts (100 per batch): 5,000 writes/sec (20ms per batch, amortized)
Throughput improved 5x.
But individual write latency went from 1ms to up to 20ms (waiting for batch).
Little's Law: the bridge between them
Little's Law connects latency, throughput, and concurrency in one equation:
L = lambda * W
L = average number of requests in the system (concurrency)
lambda = throughput (requests/second)
W = average latency (seconds/request)
Rearranged: throughput = concurrency / latency. This means halving latency (at the same concurrency) doubles throughput. Doubling concurrency (at the same latency) also doubles throughput. But adding concurrency without reducing latency eventually hits diminishing returns because queueing kicks in.
Little's Law is powerful because it tells you where your bottleneck is. If throughput is low and latency is low, you are concurrency-starved (add threads, connections, or processes). If throughput is low and latency is high, you have a latency bottleneck (find and fix the slow path). If concurrency is high and throughput is flat, you are saturated (scale out or optimize).
Batching: trading latency for throughput
Batching is the most common throughput optimization, and it always costs latency. The first item in a batch waits for the batch to fill before processing begins.
Kafka producers batch by default (linger.ms=5, batch.size=16384). PostgreSQL synchronous_commit=off batches WAL flushes. gRPC supports request batching. Every one of these improves throughput and increases individual request latency.
My rule of thumb: if the user is not waiting on the result (analytics events, logs, metrics), batch aggressively. If the user is staring at a spinner, batch minimally or not at all.
Head-to-Head Comparison
| Dimension | Optimize for latency | Optimize for throughput | Tension |
|---|---|---|---|
| Batching | Avoid; process immediately | Batch aggressively, amortize overhead | Batch size vs. wait time |
| Compression | Skip if CPU-bound; raw bytes faster | Compress to fit more data per I/O cycle | CPU cost vs. I/O savings |
| Connection pooling | Small pool, less queueing | Large pool, more concurrent work | Queue depth vs. idle connections |
| Thread count | Fewer threads, less context switching | More threads, saturate CPU/I/O | Context switch cost vs. idle CPU |
| Caching | Reduces read path time | Reduces backend load for aggregate throughput | Cold cache startup, stampede |
| Replication | Read replicas reduce read latency | More replicas handle more total reads | Replication lag consistency |
| Network calls | Minimize hops, co-locate services | Parallelize calls, fan-out/fan-in | Depth of chain vs. fan-out |
| Serialization | Compact binary (Protobuf) | Batch serialization, pre-serialize | CPU cost vs. bandwidth |
| Timeouts | Aggressive, fail fast | Generous, allow slow completions | False failures vs. head-of-line blocking |
The fundamental tension: amortizing overhead improves throughput but adds waiting time to individual requests. Every throughput optimization is, at its core, asking individual requests to wait so the system as a whole can do more work.
Interview shortcut: name the tradeoff immediately
When you propose batching, compression, or connection pooling, follow it immediately with: "This improves throughput at the cost of individual request latency. For this use case, that trade-off is acceptable because..." Naming the tradeoff before the interviewer asks about it is a senior-level signal.
When Latency Wins
Choose latency optimization when the user is directly waiting on the result. These workloads have a human at the other end of the request, and every millisecond matters.
User-facing APIs. E-commerce checkout, search results, feed loading. Google found that adding 500ms to search latency dropped traffic 20%. Amazon found that every 100ms of latency cost 1% of revenue. These numbers apply to your product too.
Real-time bidding. Ad exchanges enforce a 10ms hard cutoff. If your bid response arrives in 11ms, it is discarded. There is no "retry" or "eventually consistent." You either respond in time or you lose the auction.
Payment confirmation. The user is staring at a screen waiting to know if their purchase went through. Sub-200ms response time is the expectation. Batching payment confirmations to improve throughput would be absurd.
Multiplayer gaming. Player actions must be reflected within one frame (16ms at 60fps). Input latency above 100ms is perceptible and creates a bad experience. Throughput of the game server matters, but latency is the hard constraint.
When Throughput Wins
Choose throughput optimization when the volume of work matters more than any individual item's speed. These workloads process data in bulk with no human waiting on any single result.
Batch data pipelines. ETL jobs, nightly aggregations, data warehouse loading. Processing 10TB of data by 6 AM is the constraint. Whether any individual record takes 1ms or 100ms is irrelevant. What matters is total records per second.
Log and telemetry ingestion. Your observability pipeline processes 500K events per second. Each event does not need sub-millisecond processing. Batching events into Kafka at linger.ms=50 increases throughput 3-5x with negligible impact on log freshness.
Offline ML training. Training a recommendation model on 1 billion interactions. The job runs for hours. Throughput (training samples per second) directly determines how quickly you can iterate.
File and media processing. Video transcoding, image resizing, PDF generation. The user uploaded the file and left. Processing it in 5 minutes versus 10 minutes matters for cost. Processing each frame 1ms faster does not.
The Nuance
Here is the honest answer: latency and throughput are not always in opposition.
Caching improves both. A cache hit is faster (lower latency) and reduces backend load (higher aggregate throughput). This is why caching is the first optimization you reach for in almost every system.
Reducing serialization overhead improves both. Switching from JSON to Protobuf shrinks payloads (less network time) and reduces CPU spent parsing (faster per-request). Throughput goes up, latency goes down.
The real conflict appears when you are at the frontier. Once you have done the "free wins" (caching, efficient serialization, connection reuse), every further improvement in one dimension costs the other. Batching trades latency for throughput. Hedged requests trade throughput for latency (they consume extra capacity). Compression trades CPU latency for network throughput.
The pattern I recommend: start by optimizing for latency on the user-facing path. Once latency is acceptable, look at throughput for background and batch workloads. Never sacrifice user-facing latency for throughput unless you have exhausted every other option.
For your interview: when the interviewer says "how would you improve performance?" always ask "latency or throughput?" They are different optimizations with different techniques. Conflating them is a red flag.
Real-World Examples
Google Bigtable (hedged requests for latency). Google's Bigtable paper describes hedged requests: send the same read to two Bigtable tablets. Take whichever responds first. This cuts tail latency from ~100ms at p99 to ~10ms, at the cost of roughly 2% additional read load. Google found this 2% cost was a bargain compared to the latency improvement for user-facing services like Search and Gmail.
Kafka (batching for throughput). Kafka's producer batches messages by default. With linger.ms=0, Kafka sends each message immediately (lowest latency). With linger.ms=50 and batch.size=65536, the producer accumulates messages for up to 50ms before sending a batch. LinkedIn found this improved throughput from 50K messages/sec to 200K messages/sec per producer, with median latency increasing from 2ms to ~25ms. For their analytics pipeline, 25ms latency is invisible. For their real-time notification system, they run a separate producer with linger.ms=0.
Discord (tail latency mitigation). Discord serves messages to 150M+ monthly users across thousands of voice and text channels. Their read path fans out to multiple data stores (Cassandra for messages, Redis for sessions, PostgreSQL for metadata). They use request deadlines: every request carries a remaining-time budget. If a downstream call would exceed the budget, it is skipped and a degraded response is returned. This prevents a single slow database query from turning a 100ms API call into a 5-second timeout. The result: Discord maintains p99 under 100ms even during traffic spikes, because the system gracefully degrades rather than cascading failures through the fan-out.
How This Shows Up in Interviews
This trade-off appears in every system design interview, usually implicitly. When you propose "add a cache" or "batch writes," the interviewer may push on the latency implications. Showing you understand the trade-off elevates your answer.
What they are testing: Can you reason about performance in two distinct dimensions? Do you know that optimizing mean throughput can degrade tail latency? Can you apply Little's Law to reason about capacity?
Depth expected at senior level:
- Know percentile definitions (p50, p95, p99) and why mean is misleading
- Explain Little's Law in plain English and use it to calculate capacity
- Name the batching/latency trade-off in at least one concrete system (Kafka, PostgreSQL)
- Understand tail latency amplification in fan-out architectures
- Know when hedged requests are appropriate and their cost
| Interviewer asks | Strong answer |
|---|---|
| "How would you improve performance here?" | "First, clarify: latency or throughput? For user-facing latency, I would add caching and reduce fan-out depth. For backend throughput, I would batch writes and increase parallelism." |
| "Why not batch everything?" | "Batching amortizes overhead, so throughput improves. But the first item in each batch waits for the batch to fill. For user-facing paths, that waiting time is unacceptable. I would batch only async/background paths." |
| "What happens at high utilization?" | "Queueing theory predicts exponential latency growth. At 70% utilization, p99 is roughly 5x baseline. At 90%, it is 10x. I would set a capacity target of 70% utilization and auto-scale before hitting it." |
| "How do you handle tail latency in a fan-out architecture?" | "Tail latency amplifies with fan-out width. For 20 parallel calls with p99=50ms, 18% of requests hit at least one outlier. I would use hedged requests for the slowest 5% and set aggressive timeouts on all downstream calls." |
Interview tip: always specify the percentile
Never say "latency is 50ms" without specifying which percentile. Say "p50 is 12ms and p99 is 180ms." This signals you understand that averages hide tail behavior. If the interviewer only mentions "average latency," gently redirect to percentiles.
Quick Recap
- Latency measures the time for a single request; throughput measures total work per second. They are related through Little's Law (
L = lambda * W) but optimizing one often hurts the other. - Batching is the most common throughput optimization and the most common source of latency degradation. Every batched system trades individual request speed for aggregate efficiency.
- Queueing theory predicts exponential latency growth as utilization approaches 100%. Keep utilization under 70% for predictable p99 latency.
- Tail latency amplifies with fan-out: for N parallel calls, the probability of hitting a p99 outlier is
1 - (0.99)^N. At 20 services, that is 18% of requests. - Hedged requests mitigate tail latency at the cost of additional load. Google's approach: hedge only the slowest 5% of requests, with delayed sending, to limit overhead to 2-5%.
- In interviews, always ask "latency or throughput?" when performance is discussed. They require different techniques, and conflating them is a signal of shallow understanding.
Related Trade-offs
- Sync vs. async communication for how async messaging trades latency for throughput and decoupling
- Horizontal vs. vertical scaling for how adding nodes affects throughput ceiling and tail latency
- Batch vs. stream processing for the data pipeline version of this trade-off
- Caching for the one optimization that improves both latency and throughput simultaneously
- Load balancing for distributing work to keep individual node utilization under the latency cliff