Back-of-envelope estimation
The 3-step estimation formula for any system design interview: numbers to memorize, the math that drives decisions, and shortcuts that save time.
TL;DR
- Estimation in interviews isn't about precision. It's about arriving at a number that drives a design decision within 2-3 minutes.
- Every estimation follows the same 3-step formula: Users โ Actions โ Resources. Start from DAU, convert to requests/second, then compute storage and bandwidth.
- Memorize the infrastructure ceilings: single PostgreSQL ~10K reads/sec, single Redis ~100K ops/sec, single app server ~1K-10K req/sec depending on request profile. When your traffic exceeds a ceiling, you need the next scaling strategy.
- The read-to-write ratio is the single most important number in any estimation. It determines whether you need a cache, read replicas, or neither.
- Round aggressively. 86,400 seconds in a day? Use 100,000. It's close, and the mental math is instant. Your interviewer cares that you know which numbers matter, not that you can divide by 86,400.
Why Estimation Matters
You're designing a URL shortener. Your teammate says: "Let's shard the database." But the system only handles 100 writes per second. A single PostgreSQL instance handles 10,000 writes per second. You've just added sharding complexity for no reason.
This is what happens without estimation. Engineers reach for sophisticated solutions because they sound impressive, not because the math demands them. Estimation is the filter that prevents your design from being either too simple (under-provisioned) or too complex (over-engineered).
I've seen candidates add Kafka, Redis, Cassandra, and a CDN to a system that processes 500 requests per second. That's a single Express.js server's workload. The interviewer's internal reaction: "This person will over-engineer everything they touch."
For your interview: estimation isn't a performance you put on. It's a tool that makes your design decisions defensible. When the interviewer asks "Why did you add a cache?", you say: "Because our read traffic is 500K/sec and a single database handles 10K reads/sec. Even with 5 read replicas, we're at 50K/sec. The cache absorbs the remaining 450K reads/sec at sub-millisecond latency." That's the difference between a hand-wavy design and an engineered one.
Estimation isn't math class
The number one mistake in estimation: spending 10 minutes on arithmetic. The interviewer doesn't care if your storage calculation is 4.2 TB or 5.1 TB. They care that you identified storage as a concern and arrived at "roughly 5 TB over 5 years." Round early, round aggressively, and spend your time on the design decisions the numbers enable.
The Numbers You Must Know
These are the constants of system design. Memorize them the way a pilot memorizes V-speeds. You'll use them in every interview.
Latency numbers
| Operation | Latency | Mental model |
|---|---|---|
| L1 cache reference | 0.5 ns | Instantaneous |
| L2 cache reference | 7 ns | Still CPU cache |
| RAM reference | 100 ns | Nanoseconds |
| SSD random read | 150 ฮผs | Microseconds |
| HDD random read | 10 ms | Milliseconds (slow) |
| Same-datacenter round trip | 0.5 ms | Network hop |
| Cross-continent round trip | 150 ms | User-perceptible |
The key insight: every layer jump is roughly 10-100x slower. RAM to SSD: ~1,500x. SSD to HDD: ~67x. Local to cross-continent: ~300x. This is why caches exist at every layer.
Throughput ceilings (single instance)
| Component | Throughput | When you exceed this... |
|---|---|---|
| Web server (Node.js/Go) | 1K-10K req/sec | Add more instances behind LB |
| PostgreSQL (simple reads) | 10K queries/sec | Add read replicas or cache |
| PostgreSQL (writes) | 1K-5K writes/sec | Shard or switch to write-optimized DB |
| Redis | 100K ops/sec | Cluster mode (partition across nodes) |
| Kafka (per partition) | 10K-100K msgs/sec | Add partitions |
| Elasticsearch | 1K-10K queries/sec | Add shards |
| S3 | 5.5K PUT/sec per prefix | Distribute across prefixes |
I keep this table in my head during every design. When my estimated traffic for a component exceeds its ceiling, that's when I introduce the next scaling technique. Not before. This prevents over-engineering.
Storage and size constants
| Data | Size | Notes |
|---|---|---|
| UUID | 16 bytes | 36 chars as string |
| Timestamp | 8 bytes | Unix epoch |
| Average tweet/post text | ~300 bytes | After encoding |
| Photo (compressed) | 200 KB - 2 MB | JPEG varies by resolution |
| Video (1 min, compressed) | 10-50 MB | Depends on codec/quality |
| 1 million integers | ~4 MB | 4 bytes each |
| 1 billion rows ร 1 KB | ~1 TB | Common DB sizing |
Useful conversion factors
| Conversion | Value | Shortcut |
|---|---|---|
| Seconds in a day | 86,400 | Use ~100K (10^5) |
| Seconds in a month | ~2.5M | Use ~2.5 ร 10^6 |
| Seconds in a year | ~31.5M | Use ~3 ร 10^7 |
| 1 MB/sec sustained | ~2.5 TB/month | Useful for bandwidth costs |
| 2^10 | 1,024 | ~1 thousand (K) |
| 2^20 | ~1M | ~1 million (M) |
| 2^30 | ~1B | ~1 billion (G/Giga) |
| 2^40 | ~1T | ~1 trillion (T) |
The 3-Step Estimation Formula
Every back-of-envelope calculation follows the same structure. Once you internalize this, you can estimate any system in 3 minutes.
Step 1: Traffic (Users โ Requests/second)
Start from your Daily Active Users (DAU), which you locked down in Phase 2 (Non-Functional Requirements).
Reads per second = (DAU ร reads_per_user_per_day) / 100,000
Writes per second = (DAU ร writes_per_user_per_day) / 100,000
(We use 100K instead of 86,400 because it makes mental math instant and the error margin is under 15%, which is irrelevant for design decisions.)
Example: Instagram-like photo sharing
- DAU: 10M
- Each user views feed 5 times/day (10 photos each = 50 reads)
- Each user uploads 0.1 photos/day (1 in 10 users posts daily)
Reads/sec = (10M ร 50) / 100K = 5,000 reads/sec
Writes/sec = (10M ร 0.1) / 100K = 10 writes/sec
Read:Write ratio = 500:1
That 500:1 ratio immediately tells you: this is a read-heavy system. Your primary scaling concern is reads, not writes. A cache layer will have massive impact.
Step 2: Storage (Data per object ร Volume ร Time horizon)
Daily storage = writes_per_day ร size_per_object
Storage at Year 5 = daily_storage ร 365 ร 5
Example continued:
- 10M ร 0.1 = 1M photos/day
- Average photo: 500 KB compressed
- Daily: 1M ร 500 KB = 500 GB/day
- 5-year total: 500 GB ร 365 ร 5 = ~900 TB โ 1 PB
At 1 PB, you're in object storage territory (S3). No relational database holds this. This estimate just drove a design decision: photos go in S3, metadata goes in the database.
Step 3: Bandwidth (Data transfer per second)
Read bandwidth = reads_per_sec ร response_size
Write bandwidth = writes_per_sec ร request_size
Example continued:
- 5,000 reads/sec ร 500 KB photo = 2.5 GB/sec outbound
- That's 2.5 Gbps, which is significant. This justifies a CDN: serving 2.5 GB/sec from origin servers is expensive and slow for global users. A CDN absorbs 90%+ of this.
Putting it together
| Metric | Value | Design decision |
|---|---|---|
| Read traffic | 5K reads/sec | Cache layer (Redis) absorbs most |
| Write traffic | 10 writes/sec | Single DB primary, no sharding needed |
| Read:Write ratio | 500:1 | Read-optimized architecture |
| Storage (5yr) | ~1 PB | Object storage (S3) for photos |
| Bandwidth | 2.5 GB/sec | CDN required |
Five lines of math that justify five architectural decisions. That's the power of estimation.
Interview tip: connect every number to a decision
Never compute a number without immediately stating what it means for the design. "5,000 reads/sec" by itself is trivia. "5,000 reads/sec, which means a single PostgreSQL instance can handle it but we'd want a cache for sub-ms latency" is engineering.
Estimation Shortcuts for Common Systems
You don't need to do full estimation from scratch every time. These patterns cover 80% of interview questions.
Social media (Twitter, Instagram, Facebook)
Read:Write = 100:1 to 1000:1
DAU: 10M-500M
Key insight: feed generation is the scaling bottleneck, not storage
Design implication: aggressive caching + fanout strategy decision
Messaging (WhatsApp, Slack, Discord)
Read:Write = 1:1 (every message is written once, read by recipients)
Messages/day: DAU ร 40-100 messages per user
Key insight: connection management (WebSockets) is the bottleneck
Design implication: state management for millions of persistent connections
E-commerce (Amazon, Shopify)
Read:Write = 100:1 (browsing vs buying)
Order conversion: 2-5% of sessions
Key insight: cart and checkout are write-heavy but low-volume; catalog is read-heavy high-volume
Design implication: separate scaling strategies for catalog (cache) vs orders (ACID DB)
Video streaming (YouTube, Netflix)
Storage: massive (10M videos ร 500MB average = 5PB)
Bandwidth: the primary cost driver (1M concurrent streams ร 5 Mbps = 5 Tbps)
Key insight: bandwidth costs dominate. Storage is cheap but delivery is expensive
Design implication: CDN with adaptive bitrate streaming
Common Estimation Mistakes
| Mistake | Why it's wrong | What to do instead |
|---|---|---|
| Spending 10+ minutes on math | Wastes design time | Cap estimation at 5 minutes. Round aggressively. |
| Computing storage without a time horizon | "500 GB" means nothing without timeline | Always state: "X per day, Y over 5 years" |
| Ignoring read:write ratio | Treating all traffic as equal | Split reads and writes first. The ratio drives your architecture. |
| Using peak traffic for everything | Over-provisions the entire system | Estimate average, then note peak is 3-5x. Design for peak but size for average. |
| Estimating bandwidth but not acting on it | Computing numbers without connecting to decisions | Every bandwidth > 1 Gbps = you need a CDN. Period. |
| Forgetting metadata overhead | Photo is 500KB but you still need DB rows | Estimate data store and metadata store separately |
How This Shows Up in Interviews
When to estimate
Estimation is a tool, not a standalone phase. Pull it out during Phase 2 (Non-Functional Requirements) to set scale targets, and during Phase 5 (High-Level Architecture) to justify component choices. The numbers from estimation inform every infrastructure decision.
The signals interviewers look for
| Signal | What it looks like |
|---|---|
| Good: estimates drive decisions | "At 50K reads/sec, we need a cache. Here's why: PostgreSQL handles 10K." |
| Good: rounds to simplify math | "86,400 seconds, call it 100K. Close enough, makes the math instant." |
| Good: splits reads and writes | "Our read:write ratio is 100:1, so this is a read-heavy system." |
| Bad: estimates are decorative | Computes numbers, then designs without referencing them |
| Bad: false precision | "We need 4.217 TB of storage." Nobody needs 3 decimal places. |
| Bad: estimates everything | Computes storage for logs, metrics, backups. Only estimate what matters. |
Common interviewer follow-ups
| Interviewer asks | Strong answer |
|---|---|
| "How did you get that number?" | Show the chain: DAU โ actions โ requests/sec. Clear, reproducible. |
| "What if traffic is 10x higher?" | "At 10x, our 5K reads/sec becomes 50K. The cache still handles it (Redis does 100K ops/sec). The DB is now 500 reads/sec on misses, still fine. The bottleneck shifts to bandwidth: 25 GB/sec needs a CDN with multiple edge PoPs." |
| "Is that storage estimate realistic?" | "It's order-of-magnitude correct. In production I'd add 30% overhead for indexes, replicas, and tombstones. But for design purposes, '5 TB' vs '6.5 TB' doesn't change the architecture." |
Interview tip: say your rounding out loud
When you round 86,400 to 100,000 or 2.6M to 3M, say it: "I'm rounding up to keep the math simple. The error is under 15% and won't affect the architecture." This signals mathematical literacy and pragmatism. Both are positive signals.
Quick Recap
- Every estimation follows three steps: traffic (users to req/sec), storage (size ร volume ร time), bandwidth (req/sec ร payload size).
- Memorize infrastructure ceilings: PostgreSQL 10K reads, Redis 100K ops, app server 1-10K req/sec depending on request profile. These are the decision thresholds.
- Always split read and write traffic. The ratio drives your entire architecture.
- Round aggressively (86,400 โ 100K) and say it out loud. Precision is a waste of interview time.
- Connect every number to a design decision. An estimate without a consequence is decoration.
- Peak traffic is 3-10x average. Design for peak, size infrastructure for average with auto-scaling.
- For video/media platforms, bandwidth is the primary cost driver, not storage. For text platforms, storage and compute dominate.
Related Concepts
- Approach & Structure - The 6-phase framework that estimation plugs into. Use estimation inside Phase 2 (NFRs) and Phase 5 (Architecture) to justify decisions with numbers.
- Capacity Planning - Takes your estimates and translates them into infrastructure decisions: server counts, shard counts, replica counts.
- Scalability - The concept your estimates are sizing for. Understanding vertical vs. horizontal scaling determines which ceiling matters.
- Caching - The first component justified by estimation. When reads exceed DB capacity, caching is the answer.
- Databases - Understanding database throughput ceilings is half of the estimation skill.