Back-of-envelope estimation

TL;DR

Estimation in interviews isn't about precision. It's about arriving at a number that drives a design decision within 2-3 minutes.
Every estimation follows the same 3-step formula: Users → Actions → Resources. Start from DAU, convert to requests/second, then compute storage and bandwidth.
Memorize the infrastructure ceilings: single PostgreSQL ~10K reads/sec, single Redis ~100K ops/sec, single app server ~1K-10K req/sec depending on request profile. When your traffic exceeds a ceiling, you need the next scaling strategy.
The read-to-write ratio is the single most important number in any estimation. It determines whether you need a cache, read replicas, or neither.
Round aggressively. 86,400 seconds in a day? Use 100,000. It's close, and the mental math is instant. Your interviewer cares that you know which numbers matter, not that you can divide by 86,400.

You're designing a URL shortener. Your teammate says: "Let's shard the database." But the system only handles 100 writes per second. A single PostgreSQL instance handles 10,000 writes per second. You've just added sharding complexity for no reason.

This is what happens without estimation. Engineers reach for sophisticated solutions because they sound impressive, not because the math demands them. Estimation is the filter that prevents your design from being either too simple (under-provisioned) or too complex (over-engineered).

I've seen candidates add Kafka, Redis, Cassandra, and a CDN to a system that processes 500 requests per second. That's a single Express.js server's workload. The interviewer's internal reaction: "This person will over-engineer everything they touch."

For your interview: estimation isn't a performance you put on. It's a tool that makes your design decisions defensible. When the interviewer asks "Why did you add a cache?", you say: "Because our read traffic is 500K/sec and a single database handles 10K reads/sec. Even with 5 read replicas, we're at 50K/sec. The cache absorbs the remaining 450K reads/sec at sub-millisecond latency." That's the difference between a hand-wavy design and an engineered one.

Estimation isn't math class

The number one mistake in estimation: spending 10 minutes on arithmetic. The interviewer doesn't care if your storage calculation is 4.2 TB or 5.1 TB. They care that you identified storage as a concern and arrived at "roughly 5 TB over 5 years." Round early, round aggressively, and spend your time on the design decisions the numbers enable.

The Numbers You Must Know

These are the constants of system design. Memorize them the way a pilot memorizes V-speeds. You'll use them in every interview.

Latency numbers

Operation	Latency	Mental model
L1 cache reference	0.5 ns	Instantaneous
L2 cache reference	7 ns	Still CPU cache
RAM reference	100 ns	Nanoseconds
SSD random read	150 μs	Microseconds
HDD random read	10 ms	Milliseconds (slow)
Same-datacenter round trip	0.5 ms	Network hop
Cross-continent round trip	150 ms	User-perceptible

The key insight: every layer jump is roughly 10-100x slower. RAM to SSD: ~1,500x. SSD to HDD: ~67x. Local to cross-continent: ~300x. This is why caches exist at every layer.

Throughput ceilings (single instance)

Component	Throughput	When you exceed this...
Web server (Node.js/Go)	1K-10K req/sec	Add more instances behind LB
PostgreSQL (simple reads)	10K queries/sec	Add read replicas or cache
PostgreSQL (writes)	1K-5K writes/sec	Shard or switch to write-optimized DB
Redis	100K ops/sec	Cluster mode (partition across nodes)
Kafka (per partition)	10K-100K msgs/sec	Add partitions
Elasticsearch	1K-10K queries/sec	Add shards
S3	5.5K PUT/sec per prefix	Distribute across prefixes

I keep this table in my head during every design. When my estimated traffic for a component exceeds its ceiling, that's when I introduce the next scaling technique. Not before. This prevents over-engineering.

Storage and size constants

Data	Size	Notes
UUID	16 bytes	36 chars as string
Timestamp	8 bytes	Unix epoch
Average tweet/post text	~300 bytes	After encoding
Photo (compressed)	200 KB - 2 MB	JPEG varies by resolution
Video (1 min, compressed)	10-50 MB	Depends on codec/quality
1 million integers	~4 MB	4 bytes each
1 billion rows × 1 KB	~1 TB	Common DB sizing

Useful conversion factors

Conversion	Value	Shortcut
Seconds in a day	86,400	Use ~100K (10^5)
Seconds in a month	~2.5M	Use ~2.5 × 10^6
Seconds in a year	~31.5M	Use ~3 × 10^7
1 MB/sec sustained	~2.5 TB/month	Useful for bandwidth costs
2^10	1,024	~1 thousand (K)
2^20	~1M	~1 million (M)
2^30	~1B	~1 billion (G/Giga)
2^40	~1T	~1 trillion (T)

The 3-Step Estimation Formula

Every back-of-envelope calculation follows the same structure. Once you internalize this, you can estimate any system in 3 minutes.

Step 1: Traffic (Users → Requests/second)

Start from your Daily Active Users (DAU), which you locked down in Phase 2 (Non-Functional Requirements).

Reads per second = (DAU × reads_per_user_per_day) / 100,000
Writes per second = (DAU × writes_per_user_per_day) / 100,000

(We use 100K instead of 86,400 because it makes mental math instant and the error margin is under 15%, which is irrelevant for design decisions.)

Example: Instagram-like photo sharing

DAU: 10M
Each user views feed 5 times/day (10 photos each = 50 reads)
Each user uploads 0.1 photos/day (1 in 10 users posts daily)

Reads/sec = (10M × 50) / 100K = 5,000 reads/sec
Writes/sec = (10M × 0.1) / 100K = 10 writes/sec
Read:Write ratio = 500:1

That 500:1 ratio immediately tells you: this is a read-heavy system. Your primary scaling concern is reads, not writes. A cache layer will have massive impact.

Step 2: Storage (Data per object × Volume × Time horizon)

Daily storage = writes_per_day × size_per_object
Storage at Year 5 = daily_storage × 365 × 5

Example continued:

10M × 0.1 = 1M photos/day
Average photo: 500 KB compressed
Daily: 1M × 500 KB = 500 GB/day
5-year total: 500 GB × 365 × 5 = ~900 TB ≈ 1 PB

At 1 PB, you're in object storage territory (S3). No relational database holds this. This estimate just drove a design decision: photos go in S3, metadata goes in the database.

Step 3: Bandwidth (Data transfer per second)

Read bandwidth = reads_per_sec × response_size
Write bandwidth = writes_per_sec × request_size

Example continued:

5,000 reads/sec × 500 KB photo = 2.5 GB/sec outbound
That's 2.5 Gbps, which is significant. This justifies a CDN: serving 2.5 GB/sec from origin servers is expensive and slow for global users. A CDN absorbs 90%+ of this.

Putting it together

Metric	Value	Design decision
Read traffic	5K reads/sec	Cache layer (Redis) absorbs most
Write traffic	10 writes/sec	Single DB primary, no sharding needed
Read:Write ratio	500:1	Read-optimized architecture
Storage (5yr)	~1 PB	Object storage (S3) for photos
Bandwidth	2.5 GB/sec	CDN required

Five lines of math that justify five architectural decisions. That's the power of estimation.

Interview tip: connect every number to a decision

Never compute a number without immediately stating what it means for the design. "5,000 reads/sec" by itself is trivia. "5,000 reads/sec, which means a single PostgreSQL instance can handle it but we'd want a cache for sub-ms latency" is engineering.

Estimation Shortcuts for Common Systems

You don't need to do full estimation from scratch every time. These patterns cover 80% of interview questions.

Read:Write = 100:1 to 1000:1
DAU: 10M-500M
Key insight: feed generation is the scaling bottleneck, not storage
Design implication: aggressive caching + fanout strategy decision

Messaging (WhatsApp, Slack, Discord)

Read:Write = 1:1 (every message is written once, read by recipients)
Messages/day: DAU × 40-100 messages per user
Key insight: connection management (WebSockets) is the bottleneck
Design implication: state management for millions of persistent connections

E-commerce (Amazon, Shopify)

Read:Write = 100:1 (browsing vs buying)
Order conversion: 2-5% of sessions
Key insight: cart and checkout are write-heavy but low-volume; catalog is read-heavy high-volume
Design implication: separate scaling strategies for catalog (cache) vs orders (ACID DB)

Video streaming (YouTube, Netflix)

Storage: massive (10M videos × 500MB average = 5PB)
Bandwidth: the primary cost driver (1M concurrent streams × 5 Mbps = 5 Tbps)
Key insight: bandwidth costs dominate. Storage is cheap but delivery is expensive
Design implication: CDN with adaptive bitrate streaming

Common Estimation Mistakes

Mistake	Why it's wrong	What to do instead
Spending 10+ minutes on math	Wastes design time	Cap estimation at 5 minutes. Round aggressively.
Computing storage without a time horizon	"500 GB" means nothing without timeline	Always state: "X per day, Y over 5 years"
Ignoring read:write ratio	Treating all traffic as equal	Split reads and writes first. The ratio drives your architecture.
Using peak traffic for everything	Over-provisions the entire system	Estimate average, then note peak is 3-5x. Design for peak but size for average.
Estimating bandwidth but not acting on it	Computing numbers without connecting to decisions	Every bandwidth > 1 Gbps = you need a CDN. Period.
Forgetting metadata overhead	Photo is 500KB but you still need DB rows	Estimate data store and metadata store separately

How This Shows Up in Interviews

When to estimate

Estimation is a tool, not a standalone phase. Pull it out during Phase 2 (Non-Functional Requirements) to set scale targets, and during Phase 5 (High-Level Architecture) to justify component choices. The numbers from estimation inform every infrastructure decision.

The signals interviewers look for

Signal	What it looks like
Good: estimates drive decisions	"At 50K reads/sec, we need a cache. Here's why: PostgreSQL handles 10K."
Good: rounds to simplify math	"86,400 seconds, call it 100K. Close enough, makes the math instant."
Good: splits reads and writes	"Our read:write ratio is 100:1, so this is a read-heavy system."
Bad: estimates are decorative	Computes numbers, then designs without referencing them
Bad: false precision	"We need 4.217 TB of storage." Nobody needs 3 decimal places.
Bad: estimates everything	Computes storage for logs, metrics, backups. Only estimate what matters.

Common interviewer follow-ups

Interviewer asks	Strong answer
"How did you get that number?"	Show the chain: DAU → actions → requests/sec. Clear, reproducible.
"What if traffic is 10x higher?"	"At 10x, our 5K reads/sec becomes 50K. The cache still handles it (Redis does 100K ops/sec). The DB is now 500 reads/sec on misses, still fine. The bottleneck shifts to bandwidth: 25 GB/sec needs a CDN with multiple edge PoPs."
"Is that storage estimate realistic?"	"It's order-of-magnitude correct. In production I'd add 30% overhead for indexes, replicas, and tombstones. But for design purposes, '5 TB' vs '6.5 TB' doesn't change the architecture."

Interview tip: say your rounding out loud

When you round 86,400 to 100,000 or 2.6M to 3M, say it: "I'm rounding up to keep the math simple. The error is under 15% and won't affect the architecture." This signals mathematical literacy and pragmatism. Both are positive signals.

Quick Recap

Every estimation follows three steps: traffic (users to req/sec), storage (size × volume × time), bandwidth (req/sec × payload size).
Memorize infrastructure ceilings: PostgreSQL 10K reads, Redis 100K ops, app server 1-10K req/sec depending on request profile. These are the decision thresholds.
Always split read and write traffic. The ratio drives your entire architecture.
Round aggressively (86,400 → 100K) and say it out loud. Precision is a waste of interview time.
Connect every number to a design decision. An estimate without a consequence is decoration.
Peak traffic is 3-10x average. Design for peak, size infrastructure for average with auto-scaling.
For video/media platforms, bandwidth is the primary cost driver, not storage. For text platforms, storage and compute dominate.

Approach & Structure - The 6-phase framework that estimation plugs into. Use estimation inside Phase 2 (NFRs) and Phase 5 (Architecture) to justify decisions with numbers.
Capacity Planning - Takes your estimates and translates them into infrastructure decisions: server counts, shard counts, replica counts.
Scalability - The concept your estimates are sizing for. Understanding vertical vs. horizontal scaling determines which ceiling matters.
Caching - The first component justified by estimation. When reads exceed DB capacity, caching is the answer.
Databases - Understanding database throughput ceilings is half of the estimation skill.

TL;DR

Estimation in interviews isn't about precision. It's about arriving at a number that drives a design decision within 2-3 minutes.
Every estimation follows the same 3-step formula: Users → Actions → Resources. Start from DAU, convert to requests/second, then compute storage and bandwidth.
Memorize the infrastructure ceilings: single PostgreSQL ~10K reads/sec, single Redis ~100K ops/sec, single app server ~1K-10K req/sec depending on request profile. When your traffic exceeds a ceiling, you need the next scaling strategy.
The read-to-write ratio is the single most important number in any estimation. It determines whether you need a cache, read replicas, or neither.
Round aggressively. 86,400 seconds in a day? Use 100,000. It's close, and the mental math is instant. Your interviewer cares that you know which numbers matter, not that you can divide by 86,400.

Why Estimation Matters

Estimation isn't math class

The Numbers You Must Know

These are the constants of system design. Memorize them the way a pilot memorizes V-speeds. You'll use them in every interview.

Latency numbers

Operation	Latency	Mental model
L1 cache reference	0.5 ns	Instantaneous
L2 cache reference	7 ns	Still CPU cache
RAM reference	100 ns	Nanoseconds
SSD random read	150 μs	Microseconds
HDD random read	10 ms	Milliseconds (slow)
Same-datacenter round trip	0.5 ms	Network hop
Cross-continent round trip	150 ms	User-perceptible

The key insight: every layer jump is roughly 10-100x slower. RAM to SSD: ~1,500x. SSD to HDD: ~67x. Local to cross-continent: ~300x. This is why caches exist at every layer.

Throughput ceilings (single instance)

Component	Throughput	When you exceed this...
Web server (Node.js/Go)	1K-10K req/sec	Add more instances behind LB
PostgreSQL (simple reads)	10K queries/sec	Add read replicas or cache
PostgreSQL (writes)	1K-5K writes/sec	Shard or switch to write-optimized DB
Redis	100K ops/sec	Cluster mode (partition across nodes)
Kafka (per partition)	10K-100K msgs/sec	Add partitions
Elasticsearch	1K-10K queries/sec	Add shards
S3	5.5K PUT/sec per prefix	Distribute across prefixes

Storage and size constants

Data	Size	Notes
UUID	16 bytes	36 chars as string
Timestamp	8 bytes	Unix epoch
Average tweet/post text	~300 bytes	After encoding
Photo (compressed)	200 KB - 2 MB	JPEG varies by resolution
Video (1 min, compressed)	10-50 MB	Depends on codec/quality
1 million integers	~4 MB	4 bytes each
1 billion rows × 1 KB	~1 TB	Common DB sizing

Useful conversion factors

Conversion	Value	Shortcut
Seconds in a day	86,400	Use ~100K (10^5)
Seconds in a month	~2.5M	Use ~2.5 × 10^6
Seconds in a year	~31.5M	Use ~3 × 10^7
1 MB/sec sustained	~2.5 TB/month	Useful for bandwidth costs
2^10	1,024	~1 thousand (K)
2^20	~1M	~1 million (M)
2^30	~1B	~1 billion (G/Giga)
2^40	~1T	~1 trillion (T)

The 3-Step Estimation Formula

Every back-of-envelope calculation follows the same structure. Once you internalize this, you can estimate any system in 3 minutes.

Step 1: Traffic (Users → Requests/second)

Start from your Daily Active Users (DAU), which you locked down in Phase 2 (Non-Functional Requirements).

Reads per second = (DAU × reads_per_user_per_day) / 100,000
Writes per second = (DAU × writes_per_user_per_day) / 100,000

(We use 100K instead of 86,400 because it makes mental math instant and the error margin is under 15%, which is irrelevant for design decisions.)

Example: Instagram-like photo sharing

DAU: 10M
Each user views feed 5 times/day (10 photos each = 50 reads)
Each user uploads 0.1 photos/day (1 in 10 users posts daily)

Reads/sec = (10M × 50) / 100K = 5,000 reads/sec
Writes/sec = (10M × 0.1) / 100K = 10 writes/sec
Read:Write ratio = 500:1

That 500:1 ratio immediately tells you: this is a read-heavy system. Your primary scaling concern is reads, not writes. A cache layer will have massive impact.

Step 2: Storage (Data per object × Volume × Time horizon)

Daily storage = writes_per_day × size_per_object
Storage at Year 5 = daily_storage × 365 × 5

Example continued:

10M × 0.1 = 1M photos/day
Average photo: 500 KB compressed
Daily: 1M × 500 KB = 500 GB/day
5-year total: 500 GB × 365 × 5 = ~900 TB ≈ 1 PB

At 1 PB, you're in object storage territory (S3). No relational database holds this. This estimate just drove a design decision: photos go in S3, metadata goes in the database.

Step 3: Bandwidth (Data transfer per second)

Read bandwidth = reads_per_sec × response_size
Write bandwidth = writes_per_sec × request_size

Example continued:

5,000 reads/sec × 500 KB photo = 2.5 GB/sec outbound
That's 2.5 Gbps, which is significant. This justifies a CDN: serving 2.5 GB/sec from origin servers is expensive and slow for global users. A CDN absorbs 90%+ of this.

Putting it together

Metric	Value	Design decision
Read traffic	5K reads/sec	Cache layer (Redis) absorbs most
Write traffic	10 writes/sec	Single DB primary, no sharding needed
Read:Write ratio	500:1	Read-optimized architecture
Storage (5yr)	~1 PB	Object storage (S3) for photos
Bandwidth	2.5 GB/sec	CDN required

Five lines of math that justify five architectural decisions. That's the power of estimation.

Interview tip: connect every number to a decision

Estimation Shortcuts for Common Systems

You don't need to do full estimation from scratch every time. These patterns cover 80% of interview questions.

Read:Write = 100:1 to 1000:1
DAU: 10M-500M
Key insight: feed generation is the scaling bottleneck, not storage
Design implication: aggressive caching + fanout strategy decision

Messaging (WhatsApp, Slack, Discord)

Read:Write = 1:1 (every message is written once, read by recipients)
Messages/day: DAU × 40-100 messages per user
Key insight: connection management (WebSockets) is the bottleneck
Design implication: state management for millions of persistent connections

E-commerce (Amazon, Shopify)

Read:Write = 100:1 (browsing vs buying)
Order conversion: 2-5% of sessions
Key insight: cart and checkout are write-heavy but low-volume; catalog is read-heavy high-volume
Design implication: separate scaling strategies for catalog (cache) vs orders (ACID DB)

Video streaming (YouTube, Netflix)

Storage: massive (10M videos × 500MB average = 5PB)
Bandwidth: the primary cost driver (1M concurrent streams × 5 Mbps = 5 Tbps)
Key insight: bandwidth costs dominate. Storage is cheap but delivery is expensive
Design implication: CDN with adaptive bitrate streaming

Common Estimation Mistakes

Mistake	Why it's wrong	What to do instead
Spending 10+ minutes on math	Wastes design time	Cap estimation at 5 minutes. Round aggressively.
Computing storage without a time horizon	"500 GB" means nothing without timeline	Always state: "X per day, Y over 5 years"
Ignoring read:write ratio	Treating all traffic as equal	Split reads and writes first. The ratio drives your architecture.
Using peak traffic for everything	Over-provisions the entire system	Estimate average, then note peak is 3-5x. Design for peak but size for average.
Estimating bandwidth but not acting on it	Computing numbers without connecting to decisions	Every bandwidth > 1 Gbps = you need a CDN. Period.
Forgetting metadata overhead	Photo is 500KB but you still need DB rows	Estimate data store and metadata store separately

Signal	What it looks like
Good: estimates drive decisions	"At 50K reads/sec, we need a cache. Here's why: PostgreSQL handles 10K."
Good: rounds to simplify math	"86,400 seconds, call it 100K. Close enough, makes the math instant."
Good: splits reads and writes	"Our read:write ratio is 100:1, so this is a read-heavy system."
Bad: estimates are decorative	Computes numbers, then designs without referencing them
Bad: false precision	"We need 4.217 TB of storage." Nobody needs 3 decimal places.
Bad: estimates everything	Computes storage for logs, metrics, backups. Only estimate what matters.

Common interviewer follow-ups

Interviewer asks	Strong answer
"How did you get that number?"	Show the chain: DAU → actions → requests/sec. Clear, reproducible.
"What if traffic is 10x higher?"	"At 10x, our 5K reads/sec becomes 50K. The cache still handles it (Redis does 100K ops/sec). The DB is now 500 reads/sec on misses, still fine. The bottleneck shifts to bandwidth: 25 GB/sec needs a CDN with multiple edge PoPs."
"Is that storage estimate realistic?"	"It's order-of-magnitude correct. In production I'd add 30% overhead for indexes, replicas, and tombstones. But for design purposes, '5 TB' vs '6.5 TB' doesn't change the architecture."

Interview tip: say your rounding out loud

Quick Recap

Every estimation follows three steps: traffic (users to req/sec), storage (size × volume × time), bandwidth (req/sec × payload size).
Memorize infrastructure ceilings: PostgreSQL 10K reads, Redis 100K ops, app server 1-10K req/sec depending on request profile. These are the decision thresholds.
Always split read and write traffic. The ratio drives your entire architecture.
Round aggressively (86,400 → 100K) and say it out loud. Precision is a waste of interview time.
Connect every number to a design decision. An estimate without a consequence is decoration.
Peak traffic is 3-10x average. Design for peak, size infrastructure for average with auto-scaling.
For video/media platforms, bandwidth is the primary cost driver, not storage. For text platforms, storage and compute dominate.

Approach & Structure - The 6-phase framework that estimation plugs into. Use estimation inside Phase 2 (NFRs) and Phase 5 (Architecture) to justify decisions with numbers.
Capacity Planning - Takes your estimates and translates them into infrastructure decisions: server counts, shard counts, replica counts.
Scalability - The concept your estimates are sizing for. Understanding vertical vs. horizontal scaling determines which ceiling matters.
Caching - The first component justified by estimation. When reads exceed DB capacity, caching is the answer.
Databases - Understanding database throughput ceilings is half of the estimation skill.

Comments

Comments