Capacity planning in system design

TL;DR

Capacity planning is the bridge between estimation and architecture. You've computed the numbers, now you use them to pick components and decide scaling strategies.
Every system hits scaling inflection points: traffic thresholds where your current approach stops working and you need the next technique. Knowing these thresholds is the difference between proactive design and reactive firefighting.
The scaling ladder is: single server → separate DB → caching → horizontal app scaling → sharding → multi-region. You climb one rung at a time, not five.
For your interview: state the current traffic, name the bottleneck, pick the solution, and explain what traffic level would force the next step. That loop is the entire capacity planning skill.
Over-provisioning is as much a failure as under-provisioning. The goal isn't maximum scale. It's the right scale for the right cost.

The Problem: Numbers Without Decisions

You've done the estimation. You know you'll have 50K reads/sec and 500 writes/sec. Your storage grows at 2TB per year. Great.

Now what?

This is where most candidates stall. They have the numbers on the whiteboard but don't connect them to infrastructure choices. The interviewer waits. The candidate says something vague like "we'd need to scale the database." Scale how? More memory? Read replicas? Sharding? A different database entirely?

Capacity planning is the discipline of translating estimates into decisions. It answers: given these traffic and storage numbers, what components do I need, how many instances of each, and what breaks first as we grow?

I've watched senior candidates at Google and Meta lose points not because their estimates were wrong, but because they stopped after computing the numbers. The estimation is the setup. Capacity planning is the punchline.

For your interview: every number you write on the board should have an arrow pointing to a design decision. If an estimate doesn't change your architecture, delete it and spend that time elsewhere.

The 'just add more servers' fallacy

"We'll horizontally scale" is not a capacity plan. Which tier scales? At what threshold? What's the bottleneck that prevents linear scaling? If your database is the bottleneck, adding 100 app servers gives you 100 app servers queuing on one database. Capacity planning is about identifying which component hits its ceiling first and solving that specific bottleneck.

The Scaling Ladder

Every system follows the same progression as it grows. Each rung solves a specific bottleneck and introduces the next one.

Scaling ladder showing six tiers from single server to multi-region, with traffic thresholds, bottlenecks, and solutions at each tier — The scaling ladder. Each tier is triggered when traffic exceeds the previous tier's capacity. The key insight: you climb one rung at a time. Jumping from Tier 1 to Tier 5 is over-engineering.

Tier 1: Single Server (0 - 1K req/sec)

Setup: One machine running everything: web server, application, database.

Bottleneck: CPU and memory are shared between all processes. Database I/O competes with application processing.

When to move on: When the single server hits 70-80% CPU under normal load, or when you need deployments without downtime.

What most people get wrong: they skip this tier entirely in interviews. But for an early-stage product with 1K users, a single $100/month server running Docker Compose is the right answer. If you open a design for a startup MVP with a load balancer, Redis cluster, and Kafka, you've lost the plot.

Tier 2: Separate Database (1K - 10K req/sec)

Setup: Application server and database on separate machines. Possibly a load balancer in front of 2-3 app server instances.

Bottleneck: Database becomes the constraint. All reads and writes hit one instance.

Key decision: Is the bottleneck reads or writes?

Read-heavy (most systems): Add a cache (Tier 3) or read replicas
Write-heavy: Optimize your schema, add indexes, or start planning for sharding (Tier 5)

Capacity math: A single PostgreSQL instance handles ~10K simple reads/sec and ~1-5K writes/sec. If your traffic is below these numbers, you don't need anything else for the database tier.

Tier 3: Add Caching (10K - 100K reads/sec)

Setup: Redis or Memcached sits between the app tier and the database. Most reads are served from cache.

Bottleneck: Cache miss rate determines database load. At 95% hit rate, only 5% of reads reach the database. But if the cache fails, 100% of reads hit the DB (thundering herd).

Key decisions:

What to cache (hot data, session data, computation results)
TTL strategy (staleness tolerance drives TTL length)
Invalidation approach (TTL-based, event-driven, or both)

Capacity math: Single Redis instance handles ~100K ops/sec. At 95% hit rate with 100K reads/sec total, Redis handles 95K and the DB handles 5K. Both are within single-instance capacity.

When this tier isn't enough: When write traffic exceeds single-DB capacity, or when you need geographic distribution. Caching helps reads, not writes.

Tier 4: Horizontal App Scaling (100K+ req/sec)

Setup: Multiple stateless app servers behind a load balancer. Sessions stored externally (Redis). Shared-nothing architecture.

Bottleneck: The app tier is no longer the constraint (it scales linearly). The database, cache, and network become the limits.

Key decisions:

Load balancing algorithm (round-robin is fine for stateless)
Health checking (remove unhealthy instances quickly)
Deployment strategy (rolling updates, blue-green, canary)

Capacity math: If each app server handles 2K req/sec, and you need 100K req/sec, you need 50 app server instances. Add 20% headroom for failures and spikes: 60 instances.

This tier is the easiest to scale because app servers are stateless. If you need 2x capacity tomorrow, you double the instances. The hard part is everything stateful: databases, caches, queues.

Tier 5: Database Sharding (when writes exceed single-DB capacity)

Setup: Data is partitioned across multiple database instances by a shard key (usually user_id, tenant_id, or geographic region).

Bottleneck: Cross-shard queries, shard imbalance (hot shards), and operational complexity (schema migrations across shards).

Key decisions:

Shard key selection (determines data distribution and query patterns)
Number of shards (start with the minimum needed, typically 4-16)
Rebalancing strategy (consistent hashing makes adding shards less painful)

When to shard: When your single-primary write throughput exceeds capacity and you've already optimized queries, added proper indexes, and considered vertical scaling (bigger machine). Sharding is a last resort, not a first choice, for relational databases.

Tier 6: Multi-Region (global users, extreme availability)

Setup: Full application stack deployed in 2-3+ regions. Data replicated across regions with eventual or causal consistency.

Bottleneck: Cross-region replication lag, data consistency, and conflict resolution.

Key decisions:

Active-active vs. active-passive
Which data gets replicated where
Conflict resolution strategy (last-write-wins, CRDTs, manual resolution)

When to go multi-region: Two reasons only:

Latency: Your users are global and 200ms cross-continent latency is unacceptable
Availability: Your SLA requires surviving an entire region failure (99.99%+ uptime)

If neither applies, stay single-region. Multi-region adds enormous complexity for data consistency, deployment, and debugging. Don't do it for vanity.

The Capacity Planning Decision Framework

For every component in your design, run through this checklist:

Question	How to answer	Design decision
What's the read throughput?	From estimation: reads/sec	Need cache? Need read replicas?
What's the write throughput?	From estimation: writes/sec	Single primary enough? Need sharding?
What's the storage at Year 1? Year 5?	From estimation: size × volume × time	Fits in one DB? Need object storage?
What's the availability requirement?	From requirements: SLA %	Need replicas? Multi-region?
What's the latency requirement?	From requirements: p99 target	Need cache? Need CDN? Need local region?
What happens at 10x traffic?	Scale your estimates by 10	Which tier ceiling do you hit first?

The "10x" question is the one interviewers love. When someone asks "how would you handle 10x growth?", they want you to walk up one rung of the scaling ladder and explain what changes.

Interview tip: the 'what breaks first' technique

When presenting your design, proactively say: "At our current estimates, the system handles the load with this architecture. At 10x, the first bottleneck would be [X], and I'd solve it by [Y]. At 100x, the next bottleneck is [Z]." This demonstrates that you think about systems dynamically, not as a fixed point-in-time snapshot. It's one of the highest-signal behaviors in a staff-level interview.

Capacity Planning in Practice: Worked Example

Let's walk through a complete capacity planning exercise for an e-commerce product catalog.

Requirements (from Phase 1):

5M DAU, 50M MAU
Users browse ~20 products per session, 2 sessions per day
10K new products added per day by merchants
Product pages must load in under 200ms (p99)

Estimates (from Phase 2):

Reads: 5M × 20 × 2 / 100K = 2,000 reads/sec
Writes: 10K / 100K = 0.1 writes/sec (negligible)
Read:Write ratio: 20,000:1 (extremely read-heavy)
Storage: 10K products/day × 5KB per product × 365 × 5 = ~90 GB over 5 years
Product images: 10K × 3 images × 200KB = 6 GB/day → ~11 TB over 5 years

Capacity decisions:

Component	Decision	Reasoning
Database	Single PostgreSQL, no sharding	2K reads/sec and 0.1 writes/sec. A single instance handles 10K reads. Massive headroom.
Cache	Redis with product data	200ms p99 requirement. DB latency is 5-10ms, but with CDN and cache, we cut to under 10ms for cached products.
Product images	S3 + CDN	11 TB over 5 years. Object storage, not DB. CDN for global sub-50ms latency.
App servers	2-3 instances	2K req/sec / 1K per server = 2. Add 1 for availability.
Search	Elasticsearch	Product search queries can't be served by primary key lookups. Full-text search needs a dedicated index.

At 10x growth (50M DAU):

Reads jump to 20K/sec. Cache still handles it (far below Redis 100K ceiling).
DB sees ~1K reads/sec on cache misses with 95% hit rate. Still within single-instance capacity.
CDN bandwidth goes from 6 GB/sec to 60 GB/sec. CDN handles this natively.
First real bottleneck: Elasticsearch if search queries grow proportionally.

At 100x growth (500M DAU):

Reads: 200K/sec. Redis needs clustering (exceeds single-instance 100K ceiling).
DB: 10K reads/sec on cache misses. At single-instance limit. Add read replicas.
Now consider sharding PostgreSQL, but only if the product catalog exceeds memory (unlikely at 90GB).

For your interview: that "10x/100x" progression is exactly what you should present. It shows you can reason about growth, identify the right inflection points, and sequence your scaling investments.

How to Present Capacity Planning in an Interview

Here's the exact flow that works:

Step 1: State the bottleneck

"Based on our estimates of 50K reads/sec and 500 writes/sec, the read traffic is the primary scaling concern. A single database handles 10K reads/sec, so we need either caching or read replicas."

Step 2: Pick the solution and justify it

"I'll add a Redis cache with a 90-95% expected hit rate. At 95% hit rate, the database sees 2.5K reads/sec (5% of 50K). That's well within single-instance capacity. Redis handles 100K ops, so 47.5K cached reads is comfortable."

Step 3: State the next inflection point

"This architecture handles up to ~200K reads/sec (Redis ceiling at 100K, with a 50% overhead margin). Beyond that, I'd move to a Redis cluster. For writes, 500/sec is fine on a single primary. I wouldn't consider sharding until writes exceed 5K/sec."

Step 4: Address failure scenarios

"If Redis goes down, 50K reads/sec hit the DB directly, which is 5x its capacity. I'd mitigate with: (1) local in-process cache as L1 with 60-second TTL, (2) circuit breaker that serves stale data during cache recovery, (3) auto-scaling read replicas that kick in when cache health degrades."

This four-step pattern (bottleneck → solution → next inflection → failure mode) works for any component at any scale. Practice it until it's automatic.

How This Shows Up in Interviews

When to do capacity planning

During Phase 5 (High-Level Architecture) as you add each major component. Don't save it for the end. Every time you draw a new box, briefly state why it's needed and what load it handles.

Common interviewer follow-ups

Interviewer asks	Strong answer
"How many servers do you need?"	"At 50K req/sec with each server handling 2K req/sec, I need 25 app servers. Add 30% headroom: 33. With rolling deployments affecting 2 at a time, I'd provision 35."
"When would you shard?"	"When writes exceed 5K/sec on a single primary, and only after I've exhausted vertical scaling, query optimization, and CQRS. For this system, that's at roughly 10x current traffic."
"How would you handle a viral event?"	"The stateless app tier auto-scales. The CDN absorbs read amplification. The risk is the cache: if one viral item invalidates and all users request it simultaneously, we get thundering herd on the DB. I'd use a mutex lock pattern where only one request populates the cache and others wait."
"What's your infrastructure cost?"	"Rough estimate: 35 app servers at $200/mo = $7K, 3-node Redis cluster at $1.5K, RDS PostgreSQL at $2K, CDN at $3K for 10TB/mo, S3 at $500. Total: ~$14K/month. That's $0.0003 per request."

Interview tip: the cost estimate

Volunteering a rough infrastructure cost estimate (even if wrong by 2x) is a staff-level signal. It shows you think about systems as a whole, including the business dimension. You don't need exact AWS pricing. Ballpark it: "This architecture runs around $15K/month at our current scale."

Quick Recap

Capacity planning translates estimates into infrastructure decisions. Every number from estimation should have an arrow pointing to a component choice.
The scaling ladder has six tiers: single server → separate DB → caching → horizontal app → sharding → multi-region. Climb one rung at a time.
Always identify the bottleneck first. Is it reads, writes, storage, latency, or availability? The bottleneck determines the next scaling technique.
State inflection points explicitly: "This architecture handles up to X. Beyond that, we need Y." This is the highest-signal capacity planning behavior.
Sharding is a last resort for relational databases. Exhaust vertical scaling, query optimization, and CQRS first.
Multi-region is justified by two things only: global latency requirements or extreme availability SLAs.
Volunteering a rough infrastructure cost estimate is a staff-level signal. Even a ballpark number shows you think about systems holistically.

Estimation - The numbers that feed into capacity planning. You can't plan capacity without first estimating traffic, storage, and bandwidth.
Scalability - The underlying concept behind the scaling ladder. Vertical vs. horizontal scaling, stateless vs. stateful tiers.
Sharding - A deep dive into Tier 5 of the scaling ladder. Shard key selection, rebalancing, and the operational cost of distributed data.
Caching - Tier 3's primary tool. Cache-aside, write-through, and the thundering herd problem that makes cache failure a capacity planning concern.
Replication - The mechanism behind read replicas and multi-region data distribution. Understanding replication lag is essential for capacity planning.

TL;DR

Capacity planning is the bridge between estimation and architecture. You've computed the numbers, now you use them to pick components and decide scaling strategies.
Every system hits scaling inflection points: traffic thresholds where your current approach stops working and you need the next technique. Knowing these thresholds is the difference between proactive design and reactive firefighting.
The scaling ladder is: single server → separate DB → caching → horizontal app scaling → sharding → multi-region. You climb one rung at a time, not five.
For your interview: state the current traffic, name the bottleneck, pick the solution, and explain what traffic level would force the next step. That loop is the entire capacity planning skill.
Over-provisioning is as much a failure as under-provisioning. The goal isn't maximum scale. It's the right scale for the right cost.

The Problem: Numbers Without Decisions

You've done the estimation. You know you'll have 50K reads/sec and 500 writes/sec. Your storage grows at 2TB per year. Great.

Now what?

For your interview: every number you write on the board should have an arrow pointing to a design decision. If an estimate doesn't change your architecture, delete it and spend that time elsewhere.

The 'just add more servers' fallacy

The Scaling Ladder

Every system follows the same progression as it grows. Each rung solves a specific bottleneck and introduces the next one.

Tier 1: Single Server (0 - 1K req/sec)

Setup: One machine running everything: web server, application, database.

Bottleneck: CPU and memory are shared between all processes. Database I/O competes with application processing.

When to move on: When the single server hits 70-80% CPU under normal load, or when you need deployments without downtime.

Tier 2: Separate Database (1K - 10K req/sec)

Setup: Application server and database on separate machines. Possibly a load balancer in front of 2-3 app server instances.

Bottleneck: Database becomes the constraint. All reads and writes hit one instance.

Key decision: Is the bottleneck reads or writes?

Read-heavy (most systems): Add a cache (Tier 3) or read replicas
Write-heavy: Optimize your schema, add indexes, or start planning for sharding (Tier 5)

Capacity math: A single PostgreSQL instance handles ~10K simple reads/sec and ~1-5K writes/sec. If your traffic is below these numbers, you don't need anything else for the database tier.

Tier 3: Add Caching (10K - 100K reads/sec)

Setup: Redis or Memcached sits between the app tier and the database. Most reads are served from cache.

Bottleneck: Cache miss rate determines database load. At 95% hit rate, only 5% of reads reach the database. But if the cache fails, 100% of reads hit the DB (thundering herd).

Key decisions:

What to cache (hot data, session data, computation results)
TTL strategy (staleness tolerance drives TTL length)
Invalidation approach (TTL-based, event-driven, or both)

Capacity math: Single Redis instance handles ~100K ops/sec. At 95% hit rate with 100K reads/sec total, Redis handles 95K and the DB handles 5K. Both are within single-instance capacity.

When this tier isn't enough: When write traffic exceeds single-DB capacity, or when you need geographic distribution. Caching helps reads, not writes.

Tier 4: Horizontal App Scaling (100K+ req/sec)

Setup: Multiple stateless app servers behind a load balancer. Sessions stored externally (Redis). Shared-nothing architecture.

Bottleneck: The app tier is no longer the constraint (it scales linearly). The database, cache, and network become the limits.

Key decisions:

Load balancing algorithm (round-robin is fine for stateless)
Health checking (remove unhealthy instances quickly)
Deployment strategy (rolling updates, blue-green, canary)

Capacity math: If each app server handles 2K req/sec, and you need 100K req/sec, you need 50 app server instances. Add 20% headroom for failures and spikes: 60 instances.

This tier is the easiest to scale because app servers are stateless. If you need 2x capacity tomorrow, you double the instances. The hard part is everything stateful: databases, caches, queues.

Tier 5: Database Sharding (when writes exceed single-DB capacity)

Setup: Data is partitioned across multiple database instances by a shard key (usually user_id, tenant_id, or geographic region).

Bottleneck: Cross-shard queries, shard imbalance (hot shards), and operational complexity (schema migrations across shards).

Key decisions:

Shard key selection (determines data distribution and query patterns)
Number of shards (start with the minimum needed, typically 4-16)
Rebalancing strategy (consistent hashing makes adding shards less painful)

Tier 6: Multi-Region (global users, extreme availability)

Setup: Full application stack deployed in 2-3+ regions. Data replicated across regions with eventual or causal consistency.

Bottleneck: Cross-region replication lag, data consistency, and conflict resolution.

Key decisions:

Active-active vs. active-passive
Which data gets replicated where
Conflict resolution strategy (last-write-wins, CRDTs, manual resolution)

When to go multi-region: Two reasons only:

Latency: Your users are global and 200ms cross-continent latency is unacceptable
Availability: Your SLA requires surviving an entire region failure (99.99%+ uptime)

If neither applies, stay single-region. Multi-region adds enormous complexity for data consistency, deployment, and debugging. Don't do it for vanity.

The Capacity Planning Decision Framework

For every component in your design, run through this checklist:

Question	How to answer	Design decision
What's the read throughput?	From estimation: reads/sec	Need cache? Need read replicas?
What's the write throughput?	From estimation: writes/sec	Single primary enough? Need sharding?
What's the storage at Year 1? Year 5?	From estimation: size × volume × time	Fits in one DB? Need object storage?
What's the availability requirement?	From requirements: SLA %	Need replicas? Multi-region?
What's the latency requirement?	From requirements: p99 target	Need cache? Need CDN? Need local region?
What happens at 10x traffic?	Scale your estimates by 10	Which tier ceiling do you hit first?

The "10x" question is the one interviewers love. When someone asks "how would you handle 10x growth?", they want you to walk up one rung of the scaling ladder and explain what changes.

Interview tip: the 'what breaks first' technique

Capacity Planning in Practice: Worked Example

Let's walk through a complete capacity planning exercise for an e-commerce product catalog.

Requirements (from Phase 1):

5M DAU, 50M MAU
Users browse ~20 products per session, 2 sessions per day
10K new products added per day by merchants
Product pages must load in under 200ms (p99)

Estimates (from Phase 2):

Reads: 5M × 20 × 2 / 100K = 2,000 reads/sec
Writes: 10K / 100K = 0.1 writes/sec (negligible)
Read:Write ratio: 20,000:1 (extremely read-heavy)
Storage: 10K products/day × 5KB per product × 365 × 5 = ~90 GB over 5 years
Product images: 10K × 3 images × 200KB = 6 GB/day → ~11 TB over 5 years

Capacity decisions:

Component	Decision	Reasoning
Database	Single PostgreSQL, no sharding	2K reads/sec and 0.1 writes/sec. A single instance handles 10K reads. Massive headroom.
Cache	Redis with product data	200ms p99 requirement. DB latency is 5-10ms, but with CDN and cache, we cut to under 10ms for cached products.
Product images	S3 + CDN	11 TB over 5 years. Object storage, not DB. CDN for global sub-50ms latency.
App servers	2-3 instances	2K req/sec / 1K per server = 2. Add 1 for availability.
Search	Elasticsearch	Product search queries can't be served by primary key lookups. Full-text search needs a dedicated index.

At 10x growth (50M DAU):

Reads jump to 20K/sec. Cache still handles it (far below Redis 100K ceiling).
DB sees ~1K reads/sec on cache misses with 95% hit rate. Still within single-instance capacity.
CDN bandwidth goes from 6 GB/sec to 60 GB/sec. CDN handles this natively.
First real bottleneck: Elasticsearch if search queries grow proportionally.

At 100x growth (500M DAU):

Reads: 200K/sec. Redis needs clustering (exceeds single-instance 100K ceiling).
DB: 10K reads/sec on cache misses. At single-instance limit. Add read replicas.
Now consider sharding PostgreSQL, but only if the product catalog exceeds memory (unlikely at 90GB).

For your interview: that "10x/100x" progression is exactly what you should present. It shows you can reason about growth, identify the right inflection points, and sequence your scaling investments.

Interviewer asks	Strong answer
"How many servers do you need?"	"At 50K req/sec with each server handling 2K req/sec, I need 25 app servers. Add 30% headroom: 33. With rolling deployments affecting 2 at a time, I'd provision 35."
"When would you shard?"	"When writes exceed 5K/sec on a single primary, and only after I've exhausted vertical scaling, query optimization, and CQRS. For this system, that's at roughly 10x current traffic."
"How would you handle a viral event?"	"The stateless app tier auto-scales. The CDN absorbs read amplification. The risk is the cache: if one viral item invalidates and all users request it simultaneously, we get thundering herd on the DB. I'd use a mutex lock pattern where only one request populates the cache and others wait."
"What's your infrastructure cost?"	"Rough estimate: 35 app servers at $200/mo = $7K, 3-node Redis cluster at $1.5K, RDS PostgreSQL at $2K, CDN at $3K for 10TB/mo, S3 at $500. Total: ~$14K/month. That's $0.0003 per request."

Interview tip: the cost estimate

Quick Recap

Capacity planning translates estimates into infrastructure decisions. Every number from estimation should have an arrow pointing to a component choice.
The scaling ladder has six tiers: single server → separate DB → caching → horizontal app → sharding → multi-region. Climb one rung at a time.
Always identify the bottleneck first. Is it reads, writes, storage, latency, or availability? The bottleneck determines the next scaling technique.
State inflection points explicitly: "This architecture handles up to X. Beyond that, we need Y." This is the highest-signal capacity planning behavior.
Sharding is a last resort for relational databases. Exhaust vertical scaling, query optimization, and CQRS first.
Multi-region is justified by two things only: global latency requirements or extreme availability SLAs.
Volunteering a rough infrastructure cost estimate is a staff-level signal. Even a ballpark number shows you think about systems holistically.

Estimation - The numbers that feed into capacity planning. You can't plan capacity without first estimating traffic, storage, and bandwidth.
Scalability - The underlying concept behind the scaling ladder. Vertical vs. horizontal scaling, stateless vs. stateful tiers.
Sharding - A deep dive into Tier 5 of the scaling ladder. Shard key selection, rebalancing, and the operational cost of distributed data.
Caching - Tier 3's primary tool. Cache-aside, write-through, and the thundering herd problem that makes cache failure a capacity planning concern.
Replication - The mechanism behind read replicas and multi-region data distribution. Understanding replication lag is essential for capacity planning.

Comments

Comments