๐Ÿ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Interview Framework

Back-of-envelope estimation

The 3-step estimation formula for any system design interview: numbers to memorize, the math that drives decisions, and shortcuts that save time.

24 min read2026-03-27mediumestimationback-of-envelopecapacityinterview-frameworkhld

TL;DR

  • Estimation in interviews isn't about precision. It's about arriving at a number that drives a design decision within 2-3 minutes.
  • Every estimation follows the same 3-step formula: Users โ†’ Actions โ†’ Resources. Start from DAU, convert to requests/second, then compute storage and bandwidth.
  • Memorize the infrastructure ceilings: single PostgreSQL ~10K reads/sec, single Redis ~100K ops/sec, single app server ~1K-10K req/sec depending on request profile. When your traffic exceeds a ceiling, you need the next scaling strategy.
  • The read-to-write ratio is the single most important number in any estimation. It determines whether you need a cache, read replicas, or neither.
  • Round aggressively. 86,400 seconds in a day? Use 100,000. It's close, and the mental math is instant. Your interviewer cares that you know which numbers matter, not that you can divide by 86,400.

Why Estimation Matters

You're designing a URL shortener. Your teammate says: "Let's shard the database." But the system only handles 100 writes per second. A single PostgreSQL instance handles 10,000 writes per second. You've just added sharding complexity for no reason.

This is what happens without estimation. Engineers reach for sophisticated solutions because they sound impressive, not because the math demands them. Estimation is the filter that prevents your design from being either too simple (under-provisioned) or too complex (over-engineered).

I've seen candidates add Kafka, Redis, Cassandra, and a CDN to a system that processes 500 requests per second. That's a single Express.js server's workload. The interviewer's internal reaction: "This person will over-engineer everything they touch."

For your interview: estimation isn't a performance you put on. It's a tool that makes your design decisions defensible. When the interviewer asks "Why did you add a cache?", you say: "Because our read traffic is 500K/sec and a single database handles 10K reads/sec. Even with 5 read replicas, we're at 50K/sec. The cache absorbs the remaining 450K reads/sec at sub-millisecond latency." That's the difference between a hand-wavy design and an engineered one.

Estimation isn't math class

The number one mistake in estimation: spending 10 minutes on arithmetic. The interviewer doesn't care if your storage calculation is 4.2 TB or 5.1 TB. They care that you identified storage as a concern and arrived at "roughly 5 TB over 5 years." Round early, round aggressively, and spend your time on the design decisions the numbers enable.


The Numbers You Must Know

These are the constants of system design. Memorize them the way a pilot memorizes V-speeds. You'll use them in every interview.

Reference card showing latency numbers, throughput ceilings, and storage constants organized into three columns for quick interview reference
Your estimation reference card. The infrastructure ceilings (middle column) are the numbers that drive design decisions: when traffic exceeds a ceiling, you need the next scaling tier.

Latency numbers

OperationLatencyMental model
L1 cache reference0.5 nsInstantaneous
L2 cache reference7 nsStill CPU cache
RAM reference100 nsNanoseconds
SSD random read150 ฮผsMicroseconds
HDD random read10 msMilliseconds (slow)
Same-datacenter round trip0.5 msNetwork hop
Cross-continent round trip150 msUser-perceptible

The key insight: every layer jump is roughly 10-100x slower. RAM to SSD: ~1,500x. SSD to HDD: ~67x. Local to cross-continent: ~300x. This is why caches exist at every layer.

Throughput ceilings (single instance)

ComponentThroughputWhen you exceed this...
Web server (Node.js/Go)1K-10K req/secAdd more instances behind LB
PostgreSQL (simple reads)10K queries/secAdd read replicas or cache
PostgreSQL (writes)1K-5K writes/secShard or switch to write-optimized DB
Redis100K ops/secCluster mode (partition across nodes)
Kafka (per partition)10K-100K msgs/secAdd partitions
Elasticsearch1K-10K queries/secAdd shards
S35.5K PUT/sec per prefixDistribute across prefixes

I keep this table in my head during every design. When my estimated traffic for a component exceeds its ceiling, that's when I introduce the next scaling technique. Not before. This prevents over-engineering.

Storage and size constants

DataSizeNotes
UUID16 bytes36 chars as string
Timestamp8 bytesUnix epoch
Average tweet/post text~300 bytesAfter encoding
Photo (compressed)200 KB - 2 MBJPEG varies by resolution
Video (1 min, compressed)10-50 MBDepends on codec/quality
1 million integers~4 MB4 bytes each
1 billion rows ร— 1 KB~1 TBCommon DB sizing

Useful conversion factors

ConversionValueShortcut
Seconds in a day86,400Use ~100K (10^5)
Seconds in a month~2.5MUse ~2.5 ร— 10^6
Seconds in a year~31.5MUse ~3 ร— 10^7
1 MB/sec sustained~2.5 TB/monthUseful for bandwidth costs
2^101,024~1 thousand (K)
2^20~1M~1 million (M)
2^30~1B~1 billion (G/Giga)
2^40~1T~1 trillion (T)

The 3-Step Estimation Formula

Every back-of-envelope calculation follows the same structure. Once you internalize this, you can estimate any system in 3 minutes.

Step 1: Traffic (Users โ†’ Requests/second)

Start from your Daily Active Users (DAU), which you locked down in Phase 2 (Non-Functional Requirements).

Reads per second = (DAU ร— reads_per_user_per_day) / 100,000
Writes per second = (DAU ร— writes_per_user_per_day) / 100,000

(We use 100K instead of 86,400 because it makes mental math instant and the error margin is under 15%, which is irrelevant for design decisions.)

Example: Instagram-like photo sharing

  • DAU: 10M
  • Each user views feed 5 times/day (10 photos each = 50 reads)
  • Each user uploads 0.1 photos/day (1 in 10 users posts daily)
Reads/sec = (10M ร— 50) / 100K = 5,000 reads/sec
Writes/sec = (10M ร— 0.1) / 100K = 10 writes/sec
Read:Write ratio = 500:1

That 500:1 ratio immediately tells you: this is a read-heavy system. Your primary scaling concern is reads, not writes. A cache layer will have massive impact.

Step 2: Storage (Data per object ร— Volume ร— Time horizon)

Daily storage = writes_per_day ร— size_per_object
Storage at Year 5 = daily_storage ร— 365 ร— 5

Example continued:

  • 10M ร— 0.1 = 1M photos/day
  • Average photo: 500 KB compressed
  • Daily: 1M ร— 500 KB = 500 GB/day
  • 5-year total: 500 GB ร— 365 ร— 5 = ~900 TB โ‰ˆ 1 PB

At 1 PB, you're in object storage territory (S3). No relational database holds this. This estimate just drove a design decision: photos go in S3, metadata goes in the database.

Step 3: Bandwidth (Data transfer per second)

Read bandwidth = reads_per_sec ร— response_size
Write bandwidth = writes_per_sec ร— request_size

Example continued:

  • 5,000 reads/sec ร— 500 KB photo = 2.5 GB/sec outbound
  • That's 2.5 Gbps, which is significant. This justifies a CDN: serving 2.5 GB/sec from origin servers is expensive and slow for global users. A CDN absorbs 90%+ of this.

Putting it together

MetricValueDesign decision
Read traffic5K reads/secCache layer (Redis) absorbs most
Write traffic10 writes/secSingle DB primary, no sharding needed
Read:Write ratio500:1Read-optimized architecture
Storage (5yr)~1 PBObject storage (S3) for photos
Bandwidth2.5 GB/secCDN required

Five lines of math that justify five architectural decisions. That's the power of estimation.

Interview tip: connect every number to a decision

Never compute a number without immediately stating what it means for the design. "5,000 reads/sec" by itself is trivia. "5,000 reads/sec, which means a single PostgreSQL instance can handle it but we'd want a cache for sub-ms latency" is engineering.


Estimation Shortcuts for Common Systems

You don't need to do full estimation from scratch every time. These patterns cover 80% of interview questions.

Social media (Twitter, Instagram, Facebook)

Read:Write = 100:1 to 1000:1
DAU: 10M-500M
Key insight: feed generation is the scaling bottleneck, not storage
Design implication: aggressive caching + fanout strategy decision

Messaging (WhatsApp, Slack, Discord)

Read:Write = 1:1 (every message is written once, read by recipients)
Messages/day: DAU ร— 40-100 messages per user
Key insight: connection management (WebSockets) is the bottleneck
Design implication: state management for millions of persistent connections

E-commerce (Amazon, Shopify)

Read:Write = 100:1 (browsing vs buying)
Order conversion: 2-5% of sessions
Key insight: cart and checkout are write-heavy but low-volume; catalog is read-heavy high-volume
Design implication: separate scaling strategies for catalog (cache) vs orders (ACID DB)

Video streaming (YouTube, Netflix)

Storage: massive (10M videos ร— 500MB average = 5PB)
Bandwidth: the primary cost driver (1M concurrent streams ร— 5 Mbps = 5 Tbps)
Key insight: bandwidth costs dominate. Storage is cheap but delivery is expensive
Design implication: CDN with adaptive bitrate streaming

Common Estimation Mistakes

MistakeWhy it's wrongWhat to do instead
Spending 10+ minutes on mathWastes design timeCap estimation at 5 minutes. Round aggressively.
Computing storage without a time horizon"500 GB" means nothing without timelineAlways state: "X per day, Y over 5 years"
Ignoring read:write ratioTreating all traffic as equalSplit reads and writes first. The ratio drives your architecture.
Using peak traffic for everythingOver-provisions the entire systemEstimate average, then note peak is 3-5x. Design for peak but size for average.
Estimating bandwidth but not acting on itComputing numbers without connecting to decisionsEvery bandwidth > 1 Gbps = you need a CDN. Period.
Forgetting metadata overheadPhoto is 500KB but you still need DB rowsEstimate data store and metadata store separately

How This Shows Up in Interviews

When to estimate

Estimation is a tool, not a standalone phase. Pull it out during Phase 2 (Non-Functional Requirements) to set scale targets, and during Phase 5 (High-Level Architecture) to justify component choices. The numbers from estimation inform every infrastructure decision.

The signals interviewers look for

SignalWhat it looks like
Good: estimates drive decisions"At 50K reads/sec, we need a cache. Here's why: PostgreSQL handles 10K."
Good: rounds to simplify math"86,400 seconds, call it 100K. Close enough, makes the math instant."
Good: splits reads and writes"Our read:write ratio is 100:1, so this is a read-heavy system."
Bad: estimates are decorativeComputes numbers, then designs without referencing them
Bad: false precision"We need 4.217 TB of storage." Nobody needs 3 decimal places.
Bad: estimates everythingComputes storage for logs, metrics, backups. Only estimate what matters.

Common interviewer follow-ups

Interviewer asksStrong answer
"How did you get that number?"Show the chain: DAU โ†’ actions โ†’ requests/sec. Clear, reproducible.
"What if traffic is 10x higher?""At 10x, our 5K reads/sec becomes 50K. The cache still handles it (Redis does 100K ops/sec). The DB is now 500 reads/sec on misses, still fine. The bottleneck shifts to bandwidth: 25 GB/sec needs a CDN with multiple edge PoPs."
"Is that storage estimate realistic?""It's order-of-magnitude correct. In production I'd add 30% overhead for indexes, replicas, and tombstones. But for design purposes, '5 TB' vs '6.5 TB' doesn't change the architecture."

Interview tip: say your rounding out loud

When you round 86,400 to 100,000 or 2.6M to 3M, say it: "I'm rounding up to keep the math simple. The error is under 15% and won't affect the architecture." This signals mathematical literacy and pragmatism. Both are positive signals.



Quick Recap

  1. Every estimation follows three steps: traffic (users to req/sec), storage (size ร— volume ร— time), bandwidth (req/sec ร— payload size).
  2. Memorize infrastructure ceilings: PostgreSQL 10K reads, Redis 100K ops, app server 1-10K req/sec depending on request profile. These are the decision thresholds.
  3. Always split read and write traffic. The ratio drives your entire architecture.
  4. Round aggressively (86,400 โ†’ 100K) and say it out loud. Precision is a waste of interview time.
  5. Connect every number to a design decision. An estimate without a consequence is decoration.
  6. Peak traffic is 3-10x average. Design for peak, size infrastructure for average with auto-scaling.
  7. For video/media platforms, bandwidth is the primary cost driver, not storage. For text platforms, storage and compute dominate.

Related Concepts

  • Approach & Structure - The 6-phase framework that estimation plugs into. Use estimation inside Phase 2 (NFRs) and Phase 5 (Architecture) to justify decisions with numbers.
  • Capacity Planning - Takes your estimates and translates them into infrastructure decisions: server counts, shard counts, replica counts.
  • Scalability - The concept your estimates are sizing for. Understanding vertical vs. horizontal scaling determines which ceiling matters.
  • Caching - The first component justified by estimation. When reads exceed DB capacity, caching is the answer.
  • Databases - Understanding database throughput ceilings is half of the estimation skill.

Previous

The system design interview framework

Next

Capacity planning in system design

Comments

On This Page

TL;DRWhy Estimation MattersThe Numbers You Must KnowLatency numbersThroughput ceilings (single instance)Storage and size constantsUseful conversion factorsThe 3-Step Estimation FormulaStep 1: Traffic (Users โ†’ Requests/second)Step 2: Storage (Data per object ร— Volume ร— Time horizon)Step 3: Bandwidth (Data transfer per second)Putting it togetherEstimation Shortcuts for Common SystemsSocial media (Twitter, Instagram, Facebook)Messaging (WhatsApp, Slack, Discord)E-commerce (Amazon, Shopify)Video streaming (YouTube, Netflix)Common Estimation MistakesHow This Shows Up in InterviewsWhen to estimateThe signals interviewers look forCommon interviewer follow-upsQuick RecapRelated Concepts