Handling scale questions
A repeatable framework for when the interviewer says 'now scale it to 10x': how to identify bottlenecks, pick the right scaling strategy, and communicate the trade-offs.
TL;DR
- When the interviewer says "now scale it to 10x," they're testing systematic thinking, not knowledge of scaling tools.
- Use the 3-step scale response: identify the bottleneck, calculate the threshold where it breaks, and apply a targeted strategy.
- The scaling toolkit has six core tools: caching, read replicas, connection pooling, sharding, CDN, and message queues. Each one addresses a specific bottleneck with known capacity ranges.
- Match symptoms to bottlenecks: high read latency means cache or replicas, high write latency means queues or sharding, storage growth means tiering, bandwidth means CDN.
- The strongest signal you can send is knowing when NOT to scale. "Vertical scaling handles this load for the next two years" shows better judgment than reflexively sharding.
The "Scale It" Moment
You've just presented a clean architecture for a URL shortener. The interviewer nods, pauses, and says: "Great. Now how would you handle 100,000 requests per second?"
This is the moment most candidates fumble. They panic and start throwing buzzwords: "We'd add sharding, and Kafka, and a CDN, and Redis, and maybe Kubernetes auto-scaling." They add everything at once without identifying what actually breaks first.
I've watched interviewers mark this as a fail even when the candidate names the right tools. Why? Because listing tools isn't the skill. The skill is systematic diagnosis: what breaks, at what threshold, and what specifically fixes it. A doctor who prescribes every medication in the pharmacy isn't a good doctor, even if the right medication is in the list.
The interviewer isn't testing whether you know what Redis is. They're testing whether you can reason from first principles about where systems fail under load and apply precisely the right amount of complexity to fix it.
The all-at-once scaling trap
Adding caching, sharding, replicas, queues, and a CDN simultaneously in response to "scale it" is a red flag. It signals you don't know which component is actually the bottleneck. Always name the bottleneck first, then apply exactly one solution to address it.
The 3-Step Scale Response
Every scale question gets the same framework. I've used this in interviews on both sides of the table, and it works every time.
Step 1: Identify the Bottleneck
Before you add anything, ask: which component in your current design fails first at the new scale?
The answer is almost always one of four things:
- Database reads (too many queries per second for a single instance)
- Database writes (single primary can't absorb the write volume)
- Application compute (CPU-bound or memory-bound on app servers)
- Network bandwidth (too much data flowing through a single path)
Name it explicitly: "At 100K requests per second, the bottleneck is the database read path. Our single PostgreSQL instance handles about 10K reads per second, so we need a 10x read capacity improvement."
That sentence alone puts you ahead of 80% of candidates.
Step 2: Calculate the Threshold
Back-of-envelope math turns vague scale into concrete breaking points. This is where numbers matter.
For a URL shortener at 100K req/sec:
- Read/write ratio: 100:1 (99K reads, 1K writes per second)
- Single PostgreSQL instance: handles ~10K reads/sec
- We need ~99K reads/sec, so we're 10x over capacity on reads
- Single PostgreSQL primary: handles ~10K writes/sec
- We need ~1K writes/sec, so writes are fine
This math takes 30 seconds and tells you exactly where to focus: reads are the bottleneck, writes are not. You don't need write sharding. You need read capacity.
Step 3: Apply a Targeted Strategy
Now, and only now, you name the specific scaling tool.
"Since reads are the bottleneck, I'll add a Redis cache with an expected 95% hit rate. That reduces database reads from 99K/sec to about 5K/sec, which one PostgreSQL instance handles easily. If we need more capacity beyond that, I'll add two read replicas to handle the remaining 5K reads, giving us about 25K reads/sec in headroom."
That's the complete scale response: bottleneck identified, threshold calculated, tool applied with expected impact quantified.
Interview tip: say the number before the tool
"We need 10x read capacity, so I'll add a cache" is stronger than "I'll add a cache for reads." The number justifies the tool. Without it, every scaling decision sounds arbitrary.
The Scaling Toolkit
Here are the six core scaling tools, when each applies, and the rough capacity numbers you can cite in an interview. You don't need to memorize exact figures, but knowing the order of magnitude matters.
Caching (Redis / Memcached)
When it applies: Read-heavy workloads where the same data is requested repeatedly. Timeline reads, product catalog pages, user profiles.
Threshold trigger: Database read latency increasing, or reads exceeding ~10K/sec on a single instance.
Capacity: Redis handles ~100K to 500K reads/sec per node. A 95% cache hit rate reduces database load by 20x.
What to say: "I'm adding Redis as a cache-aside layer. With a 95% hit rate on timeline reads, database load drops from 99K reads/sec to about 5K, well within a single instance's capacity."
Read Replicas
When it applies: Read-heavy workloads where caching alone isn't sufficient, or where data freshness requirements limit cache TTL.
Threshold trigger: Even after caching, database read load exceeds single-instance capacity. Or you need geographic read locality.
Capacity: Each PostgreSQL replica handles ~10K reads/sec. Adding 3 replicas gives ~30K reads/sec total read capacity.
What to say: "For cache misses and queries that need fresh data, I'm adding three read replicas. Replication lag is typically under 100ms, which is acceptable for eventually-consistent reads like timeline loads."
Connection Pooling (PgBouncer)
When it applies: Many app server instances each opening database connections, exhausting the connection limit.
Threshold trigger: PostgreSQL connection limit (typically ~200 to 500 max connections) exceeded by the number of app server instances.
Capacity: PgBouncer in transaction mode multiplexes ~1,000 app connections into ~100 real PostgreSQL connections.
What to say: "With 50 app server instances each opening 20 connections, we'd need 1,000 database connections. PostgreSQL tops out at ~500. I'll add PgBouncer in transaction mode to multiplex connections."
Write Sharding
When it applies: Write volume exceeds single-primary capacity. Or storage size exceeds single-instance disk limits.
Threshold trigger: Single PostgreSQL primary exceeds ~10K writes/sec, or total dataset exceeds ~2 to 5 TB (where vacuum and index maintenance become painful).
Capacity: Each shard is an independent primary. 4 shards give 4x write capacity (~40K writes/sec) and 4x storage capacity.
What to say: "At 50K writes per second, a single primary can't keep up. I'll shard by user_id mod 8, giving 8 independent primaries that each handle about 6K writes/sec with headroom."
CDN (Content Delivery Network)
When it applies: High-bandwidth static content (images, videos, CSS/JS bundles) consuming origin server bandwidth and increasing latency for geographically distributed users.
Threshold trigger: Origin server bandwidth approaching network limits, or P95 latency high for users far from the data center.
Capacity: A CDN absorbs ~95% to 99% of static content requests at edge. If 10,000 users request the same image, the origin sees 1 request per CDN cache TTL, not 10,000.
What to say: "Profile images and media are high-bandwidth, low-change content. I'll serve them through a CDN. With a 5-minute TTL, a viral tweet's image generates one origin request per 5 minutes per edge location, regardless of how many users view it."
Message Queues (Kafka / SQS)
When it applies: Synchronous operations that are too slow to include in the API response path. Fan-out, notification delivery, analytics ingestion, email sending.
Threshold trigger: API response time increasing because the app server is doing work that could be deferred. Or write path needs to distribute work across many consumers.
Capacity: Kafka handles ~100K to 1M messages/sec per partition. SQS scales effectively infinitely for standard queues.
What to say: "Fan-out to followers is O(N) work per tweet. For a user with 100K followers, that's 100K writes. Doing this synchronously would make the POST /tweet response take seconds. I'll publish a fan-out event to Kafka and let worker consumers handle the timeline writes asynchronously."
Matching Symptoms to Bottlenecks
In an interview, the interviewer's phrasing often hints at the bottleneck. Here's a quick reference for mapping symptoms to strategies.
| Symptom | Likely Bottleneck | First Strategy | Second Strategy |
|---|---|---|---|
| High read latency | DB overloaded with reads | Redis cache (95% hit) | Read replicas |
| High write latency | Single primary saturated | Message queue (defer work) | Write sharding |
| Connection errors | DB connection limit hit | PgBouncer pooling | Horizontal app scale |
| Storage growing fast | Single disk limit | Archive/tiered storage | Write sharding |
| High bandwidth usage | Large static assets | CDN edge caching | Compression (gzip/brotli) |
| Spiky traffic patterns | Insufficient headroom | Auto-scaling app tier | Pre-warming cache |
My recommendation: when you hear the symptom, name the bottleneck explicitly before proposing a fix. "That sounds like a database read bottleneck" is the bridge sentence that shows systematic thinking.
Worked Example: Scaling a URL Shortener from 100 to 100K RPS
Let's walk through a concrete scaling exercise. This is the kind of progression you'd narrate in an interview.
Starting Point: 100 requests/sec
At 100 RPS, a single app server and single PostgreSQL instance handle everything. No caching, no replicas, no queues. The architecture is dead simple.
Read/write ratio: ~100:1 (99 reads/sec for redirects, 1 write/sec for URL creation). PostgreSQL handles this without breaking a sweat.
Scale to 1,000 RPS
Reads: 990/sec. Still within a single PostgreSQL instance's capacity (~10K reads/sec). Writes: 10/sec. No problem.
The only change I'd consider: adding a second app server behind a load balancer for availability (not performance). No caching needed yet.
This is the "when NOT to add complexity" answer that impresses interviewers. Saying "this is fine without optimization" shows judgment.
Scale to 10,000 RPS
Reads: 9,900/sec. Now we're approaching the single-instance limit.
"I'm adding a Redis cache for URL lookups. Short URLs rarely change after creation, so the cache hit rate should be 99%+. That reduces database reads from 9,900/sec to about 100/sec."
Writes: 100/sec. Still fine on one primary.
Scale to 100,000 RPS
Reads: 99,000/sec. Even with 99% cache hit rate, that's 990 cache misses per second hitting the database. Manageable, but let's add a read replica for safety.
Writes: 1,000/sec. One primary handles this fine (~10K writes/sec capacity).
The real question at this scale: Redis. A single Redis node handles ~500K reads/sec, so one node is plenty. But we need high availability, so a Redis Cluster with 3 nodes (primary + 2 replicas) makes sense.
Notice how each jump adds exactly one new capability. We didn't shard the database at 100K RPS because the write volume doesn't justify it. This restraint is the signal.
The "What If It's 10x More?" Ladder
Interviewers love asking successive "what about 10x more?" questions. Here's the general progression for a read-heavy web service:
100 RPS: Single server, single database. No optimization needed.
1K RPS: Add a second app server and a load balancer for availability.
10K RPS: Add caching (Redis). Most reads hit cache. Database load stays low.
100K RPS: Add read replicas for cache misses. Add more app servers. Use connection pooling.
1M RPS: Shard the database (if write volume demands it). Add a CDN for static assets. Move async work to message queues with workers.
10M RPS: Multi-region deployment. Regional caching. Geographic load balancing. This is where you start discussing cell-based architecture and cross-region replication.
For your interview: you rarely need to go beyond the 1M RPS level unless the interviewer specifically asks about multi-region. Most design interview scales are 10K to 100K RPS.
When NOT to Scale
This is the section most candidates don't know they need. Knowing when to not add complexity is as important as knowing how to add it.
Vertical scaling is underrated. A single modern database server (64 cores, 256GB RAM, NVMe SSDs) handles surprisingly high throughput. If your calculated write volume is 5K writes/sec and vertical scaling handles 10K, say: "At this volume, vertical scaling is sufficient. I'd recommend a larger instance over introducing sharding complexity."
Premature sharding is a well-known anti-pattern. Sharding adds cross-shard query complexity, transaction limitations, resharding operational burden, and a wider blast radius for schema migrations. Don't introduce it unless your back-of-envelope math shows you need it.
Caching everything isn't free. Cache invalidation is one of the two hard problems in computer science. If the data changes frequently and consistency matters, a cache creates a staleness window. Sometimes a read replica with 100ms replication lag is simpler and more consistent than a cache with complex invalidation logic.
I'll often tell candidates: "The interviewer is more impressed when you say 'we don't need sharding at this scale' than when you add sharding unnecessarily. It shows you understand the cost."
Interview tip: name the cost of scaling
Every scaling strategy has a cost. Caching adds staleness. Sharding adds operational complexity. Replicas add replication lag. When you propose a strategy, name the trade-off: "I'm adding read replicas, which introduces up to 100ms replication lag for reads. For timeline data, that's acceptable."
Common Scaling Mistakes
Wrong bottleneck. The candidate adds read replicas when the actual problem is write throughput. Or adds a cache for data that changes on every request (cache hit rate near 0%). Always identify the bottleneck before proposing a solution.
Scaling everything simultaneously. "I'll add a cache, 5 read replicas, shard the database, add Kafka, and use a CDN." This laundry list approach suggests you can't prioritize. Scale one bottleneck at a time.
No numbers. "We add more servers" without calculating how many or what throughput improvement that gives. Always quantify: "Each app server handles 5K RPS, so for 100K RPS we need 20 instances."
Premature optimization. The interviewer said "1,000 users" and you're designing for 100 million. Match your architecture to the stated scale. If they want you to think bigger, they'll ask.
Forgetting the write path. Candidates obsess over read scaling (caching, replicas) and ignore write bottlenecks entirely. For write-heavy systems (IoT sensor data, analytics ingestion, financial transactions), write sharding and message queues are the primary scaling tools.
Ignoring connection limits. Even with caching and replicas, 100 app servers each opening 20 database connections creates 2,000 connections. PostgreSQL default max is 100. Connection pooling is essential at scale and often forgotten.
How This Shows Up in Interviews
"How would you scale this to 10x?" This is the classic phrasing. Use the 3-step framework: identify bottleneck, calculate threshold, apply strategy. Do the math out loud.
"What's the first thing that breaks?" Name the specific component and the specific limit. "The PostgreSQL primary's write throughput. A single instance handles about 10K writes per second, and at this scale we need 50K."
"Can this handle Black Friday traffic?" Traffic spikes are about headroom and elasticity. Talk about auto-scaling app servers, pre-warming caches, and over-provisioning the database layer (because databases can't auto-scale as quickly as stateless services).
"What's the cost of this scaling approach?" This tests whether you understand trade-offs. Caching costs memory and adds staleness. Sharding costs operational complexity and limits cross-shard queries. Replicas cost money and add replication lag. Name the specific trade-off.
"Do we really need this?" The interviewer is testing judgment. If vertical scaling suffices, say so. If the stated requirements don't justify the added complexity, push back respectfully: "At the stated 1K writes per second, a single primary is sufficient. I'd only introduce sharding if write volume exceeds 10K per second."
Quick Recap
- Use the 3-step scale response for every "scale it" question: identify the bottleneck, calculate the threshold, apply a targeted strategy. This framework works for any scale question, regardless of the system.
- Always do back-of-envelope math out loud. "We need 99K reads/sec, the database handles 10K, so we need 10x read capacity" justifies the solution and shows structured thinking.
- Match symptoms to bottlenecks systematically. High read latency means cache or replicas. High write latency means queue or shard. Don't guess, diagnose.
- Name the cost of every scaling decision. Caching adds staleness, sharding adds operational complexity, replicas add replication lag. Trade-off awareness is the difference between a mid-level and senior answer.
- Know when NOT to scale. "Vertical scaling is sufficient at this load" shows better judgment than reflexively adding distributed components. Premature sharding is a well-documented anti-pattern.
- Scale one bottleneck at a time. Don't dump every tool at once. The interviewer wants to see you identify priorities and apply solutions incrementally.
- Be ready for the follow-up. After you scale reads, the interviewer will ask about writes. After you scale writes, they'll ask about failure modes. The 3-step framework applies to each follow-up independently.