Latency vs. throughput

TL;DR

Scenario	Optimize for latency	Optimize for throughput
User-facing API	Response under 100ms at p99	Not the priority; fast enough is fine
Batch data pipeline	Not the priority; overnight is fine	Maximize rows/second processed
Payment processing	Sub-200ms to confirm; users are staring	Moderate; not millions per second
IoT/telemetry ingestion	Seconds-old data is acceptable	Maximize events/second per node
Real-time bidding (RTB)	Under 10ms hard cutoff	High, but latency is the constraint

Default instinct: optimize for latency first, then find throughput. Latency is harder to fix retroactively. You can always batch more to raise throughput, but you cannot un-batch to recover latency.

The Framing

You are on-call at 9 AM. Your API's p50 latency is 40ms (looks fine on the dashboard). But users are complaining about "slow pages." You check p99: 1,200ms. One in every hundred requests takes over a second, and because a single page load fans out to 8 backend calls, the probability that at least one call hits p99 is 1 - (0.99)^8 = 7.7%. Almost one in twelve page loads feels broken.

This is the tension between latency and throughput. You can serve more total requests per second by batching, queuing, and amortizing overhead. But every optimization that amortizes work across requests adds waiting time to individual requests. The system processes more work per unit time, but each unit of work takes longer to complete.

I have seen teams spend weeks optimizing mean latency only to discover their p99 got worse. Mean latency is a vanity metric. Percentiles are where the pain lives.

Queueing theory predicts this: as server utilization approaches 100%, latency goes to infinity. The formula from an M/M/1 queue model is avg_wait = service_time / (1 - utilization). At 50% utilization, average wait is 2x service time. At 90%, it is 10x. At 99%, it is 100x. This is not linear. It is a hockey stick, and every production system sits somewhere on this curve.

The takeaway for your interview: capacity planning is latency planning. If you want low p99, keep utilization under 70%. If you want maximum throughput, you accept latency will spike.

The 70% utilization rule

Most production incidents I have investigated trace back to systems running above 80% utilization. At that point, any traffic spike (a retry storm, a cache flush, a deployment) pushes utilization past 90%, and latency explodes. Budget 30% headroom for spikes. If your steady-state is above 70%, you are one bad minute away from a page.

How Each Works

Latency: the time a single request waits

Latency measures the wall-clock time from when a request enters the system to when the response leaves. The key distinction is between percentiles.

p50 (median):  Half of requests are faster. Your "typical" user.
p95:           5% of requests are slower. Your "slightly unlucky" user.
p99:           1% of requests are slower. Your "worst normal-case" user.
p99.9:         One in a thousand. Dominated by GC pauses, cold caches, retries.

The gap between p50 and p99 is where systems break. A p50 of 40ms with a p99 of 1200ms means a bimodal distribution, often caused by cache misses, GC pauses, or lock contention on the slow path. I always ask teams: "What is your p99?" If they only know the mean, they do not know their latency.

Mean latency hides everything

A mean of 100ms could be 99% of requests at 10ms and 1% at 9,010ms. The mean looks fine. The experience is terrible. Always measure and alert on percentiles, never the mean.

Throughput: the work the system completes per unit time

Throughput measures operations per second, transactions per second, bytes per second, or messages per second. It answers: how much total work can the system do?

Throughput improves when you reduce per-request overhead. Batching, compression, connection pooling, and multiplexing all improve throughput by amortizing fixed costs across multiple operations.

Example: Database writes
  Individual inserts: 1,000 writes/sec (1ms per write, TCP round-trip each time)
  Batched inserts (100 per batch): 5,000 writes/sec (20ms per batch, amortized)
  
  Throughput improved 5x.
  But individual write latency went from 1ms to up to 20ms (waiting for batch).

Little's Law: the bridge between them

Little's Law connects latency, throughput, and concurrency in one equation:

  L = lambda * W

  L       = average number of requests in the system (concurrency)
  lambda  = throughput (requests/second)
  W       = average latency (seconds/request)

Rearranged: throughput = concurrency / latency. This means halving latency (at the same concurrency) doubles throughput. Doubling concurrency (at the same latency) also doubles throughput. But adding concurrency without reducing latency eventually hits diminishing returns because queueing kicks in.

Little's Law is powerful because it tells you where your bottleneck is. If throughput is low and latency is low, you are concurrency-starved (add threads, connections, or processes). If throughput is low and latency is high, you have a latency bottleneck (find and fix the slow path). If concurrency is high and throughput is flat, you are saturated (scale out or optimize).

Batching: trading latency for throughput

Batching is the most common throughput optimization, and it always costs latency. The first item in a batch waits for the batch to fill before processing begins.

Kafka producers batch by default (linger.ms=5, batch.size=16384). PostgreSQL synchronous_commit=off batches WAL flushes. gRPC supports request batching. Every one of these improves throughput and increases individual request latency.

My rule of thumb: if the user is not waiting on the result (analytics events, logs, metrics), batch aggressively. If the user is staring at a spinner, batch minimally or not at all.

Head-to-Head Comparison

Dimension	Optimize for latency	Optimize for throughput	Tension
Batching	Avoid; process immediately	Batch aggressively, amortize overhead	Batch size vs. wait time
Compression	Skip if CPU-bound; raw bytes faster	Compress to fit more data per I/O cycle	CPU cost vs. I/O savings
Connection pooling	Small pool, less queueing	Large pool, more concurrent work	Queue depth vs. idle connections
Thread count	Fewer threads, less context switching	More threads, saturate CPU/I/O	Context switch cost vs. idle CPU
Caching	Reduces read path time	Reduces backend load for aggregate throughput	Cold cache startup, stampede
Replication	Read replicas reduce read latency	More replicas handle more total reads	Replication lag consistency
Network calls	Minimize hops, co-locate services	Parallelize calls, fan-out/fan-in	Depth of chain vs. fan-out
Serialization	Compact binary (Protobuf)	Batch serialization, pre-serialize	CPU cost vs. bandwidth
Timeouts	Aggressive, fail fast	Generous, allow slow completions	False failures vs. head-of-line blocking

The fundamental tension: amortizing overhead improves throughput but adds waiting time to individual requests. Every throughput optimization is, at its core, asking individual requests to wait so the system as a whole can do more work.

Interview shortcut: name the tradeoff immediately

When you propose batching, compression, or connection pooling, follow it immediately with: "This improves throughput at the cost of individual request latency. For this use case, that trade-off is acceptable because..." Naming the tradeoff before the interviewer asks about it is a senior-level signal.

When Latency Wins

Choose latency optimization when the user is directly waiting on the result. These workloads have a human at the other end of the request, and every millisecond matters.

User-facing APIs. E-commerce checkout, search results, feed loading. Google found that adding 500ms to search latency dropped traffic 20%. Amazon found that every 100ms of latency cost 1% of revenue. These numbers apply to your product too.

Real-time bidding. Ad exchanges enforce a 10ms hard cutoff. If your bid response arrives in 11ms, it is discarded. There is no "retry" or "eventually consistent." You either respond in time or you lose the auction.

Payment confirmation. The user is staring at a screen waiting to know if their purchase went through. Sub-200ms response time is the expectation. Batching payment confirmations to improve throughput would be absurd.

Multiplayer gaming. Player actions must be reflected within one frame (16ms at 60fps). Input latency above 100ms is perceptible and creates a bad experience. Throughput of the game server matters, but latency is the hard constraint.

When Throughput Wins

Choose throughput optimization when the volume of work matters more than any individual item's speed. These workloads process data in bulk with no human waiting on any single result.

Batch data pipelines. ETL jobs, nightly aggregations, data warehouse loading. Processing 10TB of data by 6 AM is the constraint. Whether any individual record takes 1ms or 100ms is irrelevant. What matters is total records per second.

Log and telemetry ingestion. Your observability pipeline processes 500K events per second. Each event does not need sub-millisecond processing. Batching events into Kafka at linger.ms=50 increases throughput 3-5x with negligible impact on log freshness.

Offline ML training. Training a recommendation model on 1 billion interactions. The job runs for hours. Throughput (training samples per second) directly determines how quickly you can iterate.

File and media processing. Video transcoding, image resizing, PDF generation. The user uploaded the file and left. Processing it in 5 minutes versus 10 minutes matters for cost. Processing each frame 1ms faster does not.

The Nuance

Here is the honest answer: latency and throughput are not always in opposition.

Caching improves both. A cache hit is faster (lower latency) and reduces backend load (higher aggregate throughput). This is why caching is the first optimization you reach for in almost every system.

Reducing serialization overhead improves both. Switching from JSON to Protobuf shrinks payloads (less network time) and reduces CPU spent parsing (faster per-request). Throughput goes up, latency goes down.

The real conflict appears when you are at the frontier. Once you have done the "free wins" (caching, efficient serialization, connection reuse), every further improvement in one dimension costs the other. Batching trades latency for throughput. Hedged requests trade throughput for latency (they consume extra capacity). Compression trades CPU latency for network throughput.

The pattern I recommend: start by optimizing for latency on the user-facing path. Once latency is acceptable, look at throughput for background and batch workloads. Never sacrifice user-facing latency for throughput unless you have exhausted every other option.

For your interview: when the interviewer says "how would you improve performance?" always ask "latency or throughput?" They are different optimizations with different techniques. Conflating them is a red flag.

Real-World Examples

Google Bigtable (hedged requests for latency). Google's Bigtable paper describes hedged requests: send the same read to two Bigtable tablets. Take whichever responds first. This cuts tail latency from ~100ms at p99 to ~10ms, at the cost of roughly 2% additional read load. Google found this 2% cost was a bargain compared to the latency improvement for user-facing services like Search and Gmail.

Kafka (batching for throughput). Kafka's producer batches messages by default. With linger.ms=0, Kafka sends each message immediately (lowest latency). With linger.ms=50 and batch.size=65536, the producer accumulates messages for up to 50ms before sending a batch. LinkedIn found this improved throughput from 50K messages/sec to 200K messages/sec per producer, with median latency increasing from 2ms to ~25ms. For their analytics pipeline, 25ms latency is invisible. For their real-time notification system, they run a separate producer with linger.ms=0.

Discord (tail latency mitigation). Discord serves messages to 150M+ monthly users across thousands of voice and text channels. Their read path fans out to multiple data stores (Cassandra for messages, Redis for sessions, PostgreSQL for metadata). They use request deadlines: every request carries a remaining-time budget. If a downstream call would exceed the budget, it is skipped and a degraded response is returned. This prevents a single slow database query from turning a 100ms API call into a 5-second timeout. The result: Discord maintains p99 under 100ms even during traffic spikes, because the system gracefully degrades rather than cascading failures through the fan-out.

How This Shows Up in Interviews

This trade-off appears in every system design interview, usually implicitly. When you propose "add a cache" or "batch writes," the interviewer may push on the latency implications. Showing you understand the trade-off elevates your answer.

What they are testing: Can you reason about performance in two distinct dimensions? Do you know that optimizing mean throughput can degrade tail latency? Can you apply Little's Law to reason about capacity?

Depth expected at senior level:

Know percentile definitions (p50, p95, p99) and why mean is misleading
Explain Little's Law in plain English and use it to calculate capacity
Name the batching/latency trade-off in at least one concrete system (Kafka, PostgreSQL)
Understand tail latency amplification in fan-out architectures
Know when hedged requests are appropriate and their cost

Interviewer asks	Strong answer
"How would you improve performance here?"	"First, clarify: latency or throughput? For user-facing latency, I would add caching and reduce fan-out depth. For backend throughput, I would batch writes and increase parallelism."
"Why not batch everything?"	"Batching amortizes overhead, so throughput improves. But the first item in each batch waits for the batch to fill. For user-facing paths, that waiting time is unacceptable. I would batch only async/background paths."
"What happens at high utilization?"	"Queueing theory predicts exponential latency growth. At 70% utilization, p99 is roughly 5x baseline. At 90%, it is 10x. I would set a capacity target of 70% utilization and auto-scale before hitting it."
"How do you handle tail latency in a fan-out architecture?"	"Tail latency amplifies with fan-out width. For 20 parallel calls with p99=50ms, 18% of requests hit at least one outlier. I would use hedged requests for the slowest 5% and set aggressive timeouts on all downstream calls."

Interview tip: always specify the percentile

Never say "latency is 50ms" without specifying which percentile. Say "p50 is 12ms and p99 is 180ms." This signals you understand that averages hide tail behavior. If the interviewer only mentions "average latency," gently redirect to percentiles.

Quick Recap

Latency measures the time for a single request; throughput measures total work per second. They are related through Little's Law (L = lambda * W) but optimizing one often hurts the other.
Batching is the most common throughput optimization and the most common source of latency degradation. Every batched system trades individual request speed for aggregate efficiency.
Queueing theory predicts exponential latency growth as utilization approaches 100%. Keep utilization under 70% for predictable p99 latency.
Tail latency amplifies with fan-out: for N parallel calls, the probability of hitting a p99 outlier is 1 - (0.99)^N. At 20 services, that is 18% of requests.
Hedged requests mitigate tail latency at the cost of additional load. Google's approach: hedge only the slowest 5% of requests, with delayed sending, to limit overhead to 2-5%.
In interviews, always ask "latency or throughput?" when performance is discussed. They require different techniques, and conflating them is a signal of shallow understanding.

Sync vs. async communication for how async messaging trades latency for throughput and decoupling
Horizontal vs. vertical scaling for how adding nodes affects throughput ceiling and tail latency
Batch vs. stream processing for the data pipeline version of this trade-off
Caching for the one optimization that improves both latency and throughput simultaneously
Load balancing for distributing work to keep individual node utilization under the latency cliff

TL;DR

Scenario	Optimize for latency	Optimize for throughput
User-facing API	Response under 100ms at p99	Not the priority; fast enough is fine
Batch data pipeline	Not the priority; overnight is fine	Maximize rows/second processed
Payment processing	Sub-200ms to confirm; users are staring	Moderate; not millions per second
IoT/telemetry ingestion	Seconds-old data is acceptable	Maximize events/second per node
Real-time bidding (RTB)	Under 10ms hard cutoff	High, but latency is the constraint

The Framing

I have seen teams spend weeks optimizing mean latency only to discover their p99 got worse. Mean latency is a vanity metric. Percentiles are where the pain lives.

The takeaway for your interview: capacity planning is latency planning. If you want low p99, keep utilization under 70%. If you want maximum throughput, you accept latency will spike.

The 70% utilization rule

How Each Works

Latency: the time a single request waits

Latency measures the wall-clock time from when a request enters the system to when the response leaves. The key distinction is between percentiles.

p50 (median):  Half of requests are faster. Your "typical" user.
p95:           5% of requests are slower. Your "slightly unlucky" user.
p99:           1% of requests are slower. Your "worst normal-case" user.
p99.9:         One in a thousand. Dominated by GC pauses, cold caches, retries.

Mean latency hides everything

A mean of 100ms could be 99% of requests at 10ms and 1% at 9,010ms. The mean looks fine. The experience is terrible. Always measure and alert on percentiles, never the mean.

Throughput: the work the system completes per unit time

Throughput measures operations per second, transactions per second, bytes per second, or messages per second. It answers: how much total work can the system do?

Throughput improves when you reduce per-request overhead. Batching, compression, connection pooling, and multiplexing all improve throughput by amortizing fixed costs across multiple operations.

Example: Database writes
  Individual inserts: 1,000 writes/sec (1ms per write, TCP round-trip each time)
  Batched inserts (100 per batch): 5,000 writes/sec (20ms per batch, amortized)
  
  Throughput improved 5x.
  But individual write latency went from 1ms to up to 20ms (waiting for batch).

Little's Law: the bridge between them

Little's Law connects latency, throughput, and concurrency in one equation:

  L = lambda * W

  L       = average number of requests in the system (concurrency)
  lambda  = throughput (requests/second)
  W       = average latency (seconds/request)

Batching: trading latency for throughput

Batching is the most common throughput optimization, and it always costs latency. The first item in a batch waits for the batch to fill before processing begins.

My rule of thumb: if the user is not waiting on the result (analytics events, logs, metrics), batch aggressively. If the user is staring at a spinner, batch minimally or not at all.

Head-to-Head Comparison

Dimension	Optimize for latency	Optimize for throughput	Tension
Batching	Avoid; process immediately	Batch aggressively, amortize overhead	Batch size vs. wait time
Compression	Skip if CPU-bound; raw bytes faster	Compress to fit more data per I/O cycle	CPU cost vs. I/O savings
Connection pooling	Small pool, less queueing	Large pool, more concurrent work	Queue depth vs. idle connections
Thread count	Fewer threads, less context switching	More threads, saturate CPU/I/O	Context switch cost vs. idle CPU
Caching	Reduces read path time	Reduces backend load for aggregate throughput	Cold cache startup, stampede
Replication	Read replicas reduce read latency	More replicas handle more total reads	Replication lag consistency
Network calls	Minimize hops, co-locate services	Parallelize calls, fan-out/fan-in	Depth of chain vs. fan-out
Serialization	Compact binary (Protobuf)	Batch serialization, pre-serialize	CPU cost vs. bandwidth
Timeouts	Aggressive, fail fast	Generous, allow slow completions	False failures vs. head-of-line blocking

Interview shortcut: name the tradeoff immediately

Know percentile definitions (p50, p95, p99) and why mean is misleading
Explain Little's Law in plain English and use it to calculate capacity
Name the batching/latency trade-off in at least one concrete system (Kafka, PostgreSQL)
Understand tail latency amplification in fan-out architectures
Know when hedged requests are appropriate and their cost

Interviewer asks	Strong answer
"How would you improve performance here?"	"First, clarify: latency or throughput? For user-facing latency, I would add caching and reduce fan-out depth. For backend throughput, I would batch writes and increase parallelism."
"Why not batch everything?"	"Batching amortizes overhead, so throughput improves. But the first item in each batch waits for the batch to fill. For user-facing paths, that waiting time is unacceptable. I would batch only async/background paths."
"What happens at high utilization?"	"Queueing theory predicts exponential latency growth. At 70% utilization, p99 is roughly 5x baseline. At 90%, it is 10x. I would set a capacity target of 70% utilization and auto-scale before hitting it."
"How do you handle tail latency in a fan-out architecture?"	"Tail latency amplifies with fan-out width. For 20 parallel calls with p99=50ms, 18% of requests hit at least one outlier. I would use hedged requests for the slowest 5% and set aggressive timeouts on all downstream calls."

Interview tip: always specify the percentile

Quick Recap

Latency measures the time for a single request; throughput measures total work per second. They are related through Little's Law (L = lambda * W) but optimizing one often hurts the other.
Batching is the most common throughput optimization and the most common source of latency degradation. Every batched system trades individual request speed for aggregate efficiency.
Queueing theory predicts exponential latency growth as utilization approaches 100%. Keep utilization under 70% for predictable p99 latency.
Tail latency amplifies with fan-out: for N parallel calls, the probability of hitting a p99 outlier is 1 - (0.99)^N. At 20 services, that is 18% of requests.
Hedged requests mitigate tail latency at the cost of additional load. Google's approach: hedge only the slowest 5% of requests, with delayed sending, to limit overhead to 2-5%.
In interviews, always ask "latency or throughput?" when performance is discussed. They require different techniques, and conflating them is a signal of shallow understanding.

Sync vs. async communication for how async messaging trades latency for throughput and decoupling
Horizontal vs. vertical scaling for how adding nodes affects throughput ceiling and tail latency
Batch vs. stream processing for the data pipeline version of this trade-off
Caching for the one optimization that improves both latency and throughput simultaneously
Load balancing for distributing work to keep individual node utilization under the latency cliff

Latency vs. throughput

TL;DR

The Framing

How Each Works

Latency: the time a single request waits

Throughput: the work the system completes per unit time

Little's Law: the bridge between them

Batching: trading latency for throughput

Head-to-Head Comparison

When Latency Wins

When Throughput Wins

The Nuance

Real-World Examples

How This Shows Up in Interviews

Quick Recap

Comments

Latency vs. throughput

TL;DR

The Framing

How Each Works

Latency: the time a single request waits

Throughput: the work the system completes per unit time

Little's Law: the bridge between them

Batching: trading latency for throughput

Head-to-Head Comparison

When Latency Wins

When Throughput Wins

The Nuance

Real-World Examples

How This Shows Up in Interviews

Quick Recap

Comments