๐Ÿ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Concepts

Scalability

Learn why systems break under load, how horizontal and vertical scaling work, and how to design for 10x traffic without a 3 a.m. outage.

31 min read2026-03-22mediumscalabilityhldconceptssystem-designfundamentals

TL;DR

  • Scalability is a system's ability to handle growing load by adding resources, without redesigning from scratch.
  • The two core levers are vertical scaling (bigger machines) and horizontal scaling (more machines). Horizontal wins long-term.
  • Stateless services are the prerequisite for horizontal scaling. Sessions must live in Redis, not in memory.
  • The DB is almost always the first bottleneck. Fix it with read replicas before reaching for sharding.
  • The fundamental trade-off: every layer you add multiplies your capacity but also multiplies your failure surface.

The Problem It Solves

Your app is fine at 1,000 users. Then your product gets posted on Hacker News at 9 a.m. on a Tuesday. By 9:05, you have 50,000 concurrent users.

The single server's CPU pegs at 100%. The request queue fills up. New connections get refused, users see spinning loaders, and your on-call phone rings.

And the worst part? The server isn't broken. It's doing exactly what you built it to do. You just built one of it.

The scaling blindspot

Most outages at fast-growing companies aren't caused by bugs. They're caused by architecture that was never designed to handle the load it's currently receiving. A system that worked at 10,000 users fails silently at 100,000.

A single app server receiving 10,000 concurrent requests, showing CPU at 100%, memory at 96%, request queue full, and latency at 8.3 seconds, with 503 and timeout errors going back to users
One server can only go so far. When the queue fills up, every new request gets dropped or timed out โ€” affecting every user at once.

Scalability is the discipline of designing systems so that this moment never happens. Or at least: so that when it does, you can fix it in 10 minutes instead of 10 hours. The mistake I see most often is teams treating it as a future problem โ€” and by the time it's urgent, the path to fixing it cleanly has already closed.


What Is It?

Scalability is a system property โ€” how gracefully a system can accommodate increased load as users, data, and requests grow. A system is scalable if adding resources produces a proportional gain in capacity. I find it helps to think of it less as a feature you build and more as a constraint you design around from the start.

Analogy: Think of a coffee shop on a busy morning. When the line gets long, you have two options: buy a faster espresso machine (vertical scaling), or open a second counter with another barista (horizontal scaling).

Option A has a physical limit โ€” the shop only has so much floor space and the best machine only makes coffee so fast. Option B scales as long as you can hire people and find counter space. Modern system design is almost entirely option B.

Two diagrams side by side: left shows vertical scaling with servers growing in size until hitting a hardware ceiling wall; right shows horizontal scaling with a load balancer distributing to multiple identical servers, with a dashed box showing more can be added
Vertical scaling hits a hardware ceiling. Horizontal scaling has no ceiling โ€” just increasing operational complexity.

Vertical scaling is a short-term fix. Horizontal scaling is the architecture.


How It Works

There's no single "enable scalability" switch. You scale a system by identifying its bottleneck at each order of magnitude of traffic and applying the right lever. Here's what that looks like in practice:

Traffic tierBottleneckWhat you add
~100 usersNothing. Monolith is fine.Single server. Ship it.
~1,000 usersSingle app server CPULoad balancer + second server. Make sessions stateless (Redis).
~10,000 usersRepeated DB readsCache layer (Redis). CDN for static assets.
~100,000 usersDB read throughputRead replicas. Long jobs go into a queue (Kafka, SQS).
~1,000,000 usersDB write throughput, data volumeDB sharding or managed distributed DB. App tier auto-scales.
~10,000,000 usersRegional latency, global coordinationMulti-region deployment. Global CDN. Eventual consistency.

The key insight: you're not rewriting the app at each tier, you're adding infrastructure layers around it. The business logic stays the same. My recommendation in an interview: walk through this table explicitly and stop at the tier that matches the scale you're being asked to design for โ€” don't jump straight to sharding.

The foundational change that unlocks horizontal scaling is making your services stateless. This means the server process itself holds no user-specific data in memory. Any instance can handle any request:

// Bad: session stored in the server process memory
// Only the server that created this session can serve this user
app.use(session({ store: new MemoryStore() }));

// Good: session stored in Redis โ€” any server instance can read it
app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
}));

Once your app tier is stateless, a load balancer can send each request to any available instance. You can add instances when traffic spikes and remove them when it drops. When an instance crashes, requests route around it โ€” statelessness is the one prerequisite that unlocks everything else in this section.

flowchart TD
    subgraph Internet["๐ŸŒ Internet Layer"]
        Users(["๐Ÿ‘ค Users\n~1M+ concurrent"])
        CDN["๐ŸŒ CDN\nEdge-cached ยท < 10ms\nStatic: JS ยท CSS ยท Images ยท Video\nOffloads ~60โ€“80% of all traffic"]
    end

    subgraph AppTier["โš™๏ธ  Stateless App Tier โ€” Auto-Scaled"]
        LB["๐Ÿ”€ Load Balancer\nRound-robin ยท Health checks\nSSL termination ยท Sticky sessions off"]
        AS1["โš™๏ธ App Server 1\nStateless ยท No local state\nAny instance handles any request"]
        AS2["โš™๏ธ App Server 2\nStateless ยท No local state\nAuto-scaled on CPU / p95 latency"]
        AS3["โš™๏ธ App Server N\nStateless ยท No local state\nNew instance boots in < 2 min"]
    end

    subgraph CacheTier["โšก Cache Tier"]
        Redis["โšก Redis Cluster\n< 1ms reads ยท ~90%+ hit rate\nSessions ยท Hot reads ยท Counters\nTarget: keeps DB read load flat"]
    end

    subgraph DBTier["๐Ÿ—„๏ธ Database Tier"]
        DB[("๐ŸŸข Primary DB\nWrites only ยท ACID\nStrong consistency\nSingle source of truth")]
        R1[("๐Ÿ”ต Read Replica 1\nAll SELECT queries\nAsync replicated from primary")]
        R2[("๐Ÿ”ต Read Replica 2\nAll SELECT queries\nAsync replicated from primary")]
    end

    Users -->|"HTTPS"| CDN
    CDN -->|"Cache miss ยท dynamic requests"| LB
    LB -->|"Route to healthy instance"| AS1 & AS2 & AS3
    AS1 & AS2 & AS3 -->|"Cache reads / writes"| Redis
    AS1 & AS2 & AS3 -->|"Writes ยท mutations only"| DB
    Redis -->|"Cache miss โ†’ read replica"| R1 & R2
    DB -.->|"Async replication ยท ~10โ€“50ms lag"| R1
    DB -.->|"Async replication ยท ~10โ€“50ms lag"| R2

Key Components

ComponentRole in Scalability
Load BalancerDistributes requests across app server instances. Enables horizontal scaling. Performs health checks and removes failing instances.
CDNServes static assets (JS, CSS, images, video) from edge nodes worldwide. Offloads a significant percentage of all bandwidth from your origin servers.
Stateless App TierEach server handles any request independently. No local session data. Auto-scaling groups can add/remove instances in under 2 minutes.
Cache (Redis)Serves hot reads from memory in under 1 ms. Keeps DB read load flat as traffic grows. Target hit rate above 90%.
Message QueueAbsorbs write bursts. Producers push events; consumers process at a controlled rate. Decouples spiky request volume from steady DB write throughput.
Read ReplicasDatabase copies that handle read traffic. Async-replicated from the primary. Buy 2-4x read headroom before you ever need to consider sharding.
Database ShardingSplits data across multiple DB instances by a shard key (e.g., user ID). Removes the per-node data and write-throughput ceiling. Last resort.
Auto-Scaling GroupMonitors CPU, latency, or queue depth metrics and adds/removes app server instances automatically. Handles traffic spikes without manual intervention.

Types / Variations

Vertical vs. Horizontal Scaling

Vertical (Scale Up)Horizontal (Scale Out)
MechanismMore CPU/RAM on one machineMore machines
CeilingHard hardware limitNear-unlimited
Failure modeSingle point of failureOne instance down, others continue
ComplexitySimple โ€” no code changesRequires stateless design, coordination
Best forStateful DBs (short-term)Stateless app tier, caches, queues
Cost curveSteep โ€” large machines cost disproportionately moreLinear โ€” each instance costs the same

Here's the thing most people miss: the table above doesn't tell you which to choose โ€” it tells you which constraint you're working with. When you see a database that can't scale horizontally in an interview, that's a vertical scaling or read-replica question, not a sharding question. Reach for sharding only when replicas and caching have been exhausted.

Reactive vs. Predictive Auto-scaling

Most cloud auto-scaling groups are reactive: they watch a metric (CPU, request latency, queue depth) and trigger when it crosses a threshold. The downside is warm-up time โ€” adding a new instance takes 1-3 minutes, during which the existing instances absorb the load spike.

Predictive scaling uses ML to forecast traffic patterns and pre-warms instances before the spike hits. It's mandatory if you have known traffic events like product launches or sale events.

When you mention auto-scaling in an interview, always name the metric you'd use โ€” "CPU above 70%" or "p95 latency over 200ms." Saying "auto-scaling based on load" is too vague to signal real experience.

Interview tip: name your metric

When you mention auto-scaling in an interview, always say which metric you'd use. For compute-bound services: CPU utilization. For queue consumers: queue depth. For API gateways: request latency. "Auto-scaling based on load" is vague. "Auto-scaling on p95 latency exceeding 200ms" is specific and signals experience.

Geographic Scaling

Single-region architecture breaks when users are globally distributed. A server in us-east-1 adds 200ms of base latency for a user in Tokyo. Here's the order of operations, from least to most drastic:

  1. CDN for static assets first.
  2. Read replicas in each region for low-latency reads.
  3. GeoDNS routing to send each user to the nearest region.
  4. Multi-region active-active for Tier 1 systems โ€” every region accepts writes, replication runs bi-directionally.

Multi-region active-active means you're now dealing with conflict resolution, network partitions, and clock skew. Reach for it only when regional latency is a confirmed user-facing problem. Most systems that adopt geographic scaling never need to go past step 2.


Database Scaling Deep Dive

The database is almost always the first serious bottleneck. App servers are stateless and cheap to add. Databases are stateful and hard to distribute.

Two database scaling patterns: top half shows read replicas with a primary database handling writes and three replicas handling reads, connected by async replication; bottom half shows sharding with a shard router distributing writes across three shards by hash(userID)
Start with read replicas. Sharding is reserved for write bottlenecks and data volume limits โ€” not your first move.

Read replicas work because most applications are heavily read-skewed (often 90%+ reads). You add one or more async-replicated copies of the primary and route all SELECT queries to them. The primary handles only writes.

This is often enough to get you from 100,000 to 1,000,000 users without touching your schema. I'll often see candidates jump straight to sharding when the answer is actually just a read replica โ€” it's a far simpler operation and reversible if you misconfigure it.

Replication lag is real

Read replicas are eventually consistent. Changes written to the primary won't appear on replicas for a few milliseconds, or longer under heavy load. For reads where freshness matters โ€” reading your own writes, payment confirmations โ€” always route to the primary.

Sharding splits the dataset itself across multiple independent database instances. Each shard owns a subset of the data (typically partitioned by a hash of the primary key). Cross-shard queries (joining data owned by different shards) are expensive and often require denormalization.

Use sharding when:

  • Your dataset is too large to fit on a single machine's storage.
  • Your write throughput has outpaced what a single primary can handle.
  • You've already exhausted read replicas, caching, and query optimization.

The order matters: cache first, read replicas second, sharding last. Every time I've seen a team skip this progression, they end up doing expensive migration work that replicas alone would have bought them another 18 months to avoid.


Trade-offs

ProsCons
Handles traffic spikes without redesignDistributed systems are harder to debug and reason about
Enables zero-downtime rolling deploymentsStateless mandate requires session infrastructure that must itself be HA
Fault tolerant โ€” one instance down, others serve trafficDB sharding introduces cross-shard query complexity
Cost-efficient โ€” scale in when traffic dropsMore components mean more potential failure points
Each layer scales independentlyEventual consistency means some reads may see stale data

The fundamental tension here is capacity vs. complexity. Every infrastructure layer you add multiplies your headroom but compounds the number of things that can go wrong at 3 a.m. I always ask one question before recommending a new layer: "Are you solving a problem you currently have, or one you imagine you'll have?"

The goal is to add exactly enough complexity for your current traffic tier. Adding more than that is just technical debt with a different name.

The premature scaling trap

Netflix-scale architecture at 100 users is a liability, not an asset. It's more complex to debug, slower to iterate on, and more expensive to run. Design for 10x your current load, not 1000x.


When to Use It / When to Avoid It

So when does this actually matter? Almost every design question involving millions of users will touch scaling. The decision logic below is how I work through it โ€” start with the simplest fix and escalate only when that's confirmed insufficient.

Scale horizontally when:

  • Your service is stateless, or can be made stateless with minimal effort.
  • You need fault tolerance as well as throughput.
  • Traffic is unpredictable or follows spiky patterns (consumer apps, marketing events).
  • You're running in a cloud environment with auto-scaling available.

Scale vertically when:

  • You have a stateful component with tight consistency requirements (single-primary DB).
  • The data or compute lives on a single node and distributing it adds more complexity than it solves.
  • It's a short-term fix while you architect the horizontal migration.

Avoid over-engineering when:

  • You're under 10,000 concurrent users and a monolith serves everyone fine.
  • You haven't profiled and confirmed the bottleneck is compute. Slow queries, N+1 patterns, and missing indexes cause more outages than insufficient server count.
  • The team doesn't yet have the operational maturity to run distributed infrastructure reliably.

If you're not sure whether you need to scale, you probably don't yet.

Profile before you scale

Before adding servers, run EXPLAIN ANALYZE on your slow queries. Add the missing index. Fix the N+1. In many cases, a query optimization delivers more headroom than doubling your server count and takes 20 minutes instead of 3 days.


Real-World Examples

Netflix runs thousands of stateless microservices on AWS, auto-scaled across multiple regions. At peak (~8 p.m. weekday), they serve 250 million concurrent streams. Statelessness is non-negotiable โ€” every service team is required to design for zero local state so that auto-scaling works without coordination.

Instagram grew to 1 billion users with a surprisingly small engineering team by staying disciplined about not scaling prematurely. They ran a Python/Django monolith and scaled PostgreSQL with read replicas for years before considering sharding. They disaggregated services gradually, not all at once.

Discord went from 8 servers to 850,000 concurrent users in under two years. Their biggest scaling win came not from adding more app servers, but from migrating hot gaming presence data from Cassandra to ScyllaDB. The lesson is one I return to constantly: the bottleneck is almost never where you assume.

Profile first โ€” then scale what's actually bottlenecked.


How This Shows Up in Interviews

In most design interviews, you're not being tested on whether you know what a read replica is โ€” you're being tested on whether you understand when it's the right move and why. The difference, in my experience, comes down to sequencing: name the bottleneck first, then walk through the scaling ladder in order, then address consistency.

When to bring it up

In any design question involving millions of users, proactively explain your scaling strategy within the first 5 minutes. Don't wait to be asked. A sentence like "I'll make the app tier stateless so we can auto-scale it, and add a cache layer to keep DB load manageable" signals immediately that you understand the problem space.

Depth expected at senior and staff level:

  • Identify the bottleneck correctly. Is it compute? Read throughput? Write throughput? Data volume? Different answers require different solutions.
  • Don't jump to sharding. The correct progression is: caching first, read replicas second, sharding as a last resort.
  • Know what your auto-scaling metric should be โ€” and why. This varies by workload type.
  • Address the warm-up gap: when a new instance starts, it needs time to initialize. What happens to requests during that window?
  • Discuss consistency. Once you have replicas or caches, your reads may be stale. Know where that's acceptable and where it isn't.

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"Your app server is at 80% CPU. What do you do?""First, confirm it's compute-bound and not a slow query. If it's compute, add a load balancer and a second stateless instance. Move sessions to Redis first so any server can handle any request."
"The DB is your bottleneck at 100k users. What's step one?""Add a read replica. Most workloads are 90% reads. Route SELECTs to the replica, INSERTs and UPDATEs to the primary. That alone usually buys 2 to 4x of headroom."
"How do you scale a write-heavy service?""Queue the writes. Producers push to Kafka or SQS. Consumers write to the DB at a controlled rate. You decouple the incoming request spike from the DB's write throughput."
"What metric should the auto-scaling group use?""Whatever your actual bottleneck is. CPU for compute-bound services. Queue depth for consumers. p95 request latency for latency-sensitive APIs. There's no universal answer."
"How do you design for a 10x traffic spike?""Stateless app tier plus auto-scaling. Pre-warm the cache before a known event. Put writes behind a queue. Have a runbook for manually bumping the replica count if the primary shows strain."

The throughline in every strong answer: name the bottleneck explicitly, follow the scaling ladder in order, and address the consistency implications of what you add.


Test Your Understanding


Quick Recap

  1. Scalability is the ability to handle growing load by adding resources without a redesign. The goal is proportional gains in capacity from proportional additions of resources.
  2. Vertical scaling (bigger machines) hits a hard ceiling and creates a single point of failure. Horizontal scaling (more machines) has no ceiling but requires stateless design.
  3. Stateless services are the prerequisite for horizontal scaling. Sessions, user context, and ephemeral state must live in an external store like Redis, not in the process itself.
  4. The database is almost always the first serious bottleneck. Add a read replica before reaching for sharding. Add caching before adding read replicas.
  5. Sharding scales write throughput and data volume but introduces cross-shard query complexity and is expensive to undo. Treat it as a last resort.
  6. Auto-scaling handles traffic spikes automatically, but pick the right metric (CPU, queue depth, latency) and account for the instance warm-up gap.
  7. In interviews, name the bottleneck explicitly, follow the scaling ladder in order (cache first, replicas second, sharding last), and always address the consistency implications of what you add.

Related Concepts

  • Load Balancing โ€” The mechanism that makes horizontal scaling of the app tier actually work. Understanding health checks and routing algorithms is essential.
  • Caching โ€” The fastest way to reduce load on your DB without adding any new servers. Cache hit rate determines how much work your database actually has to do.
  • Sharding โ€” The deep dive on splitting data across multiple database instances, including how to choose a shard key and what resharding costs.
  • Databases โ€” Covers the read replica setup in detail, including consistency guarantees, replication lag, and when to route reads to the primary.
  • Message Queues โ€” The tool for scaling write-heavy workloads by decoupling producers from consumers and absorbing traffic bursts.

Next

Load balancing

Comments

On This Page

TL;DRThe Problem It SolvesWhat Is It?How It WorksKey ComponentsTypes / VariationsVertical vs. Horizontal ScalingReactive vs. Predictive Auto-scalingGeographic ScalingDatabase Scaling Deep DiveTrade-offsWhen to Use It / When to Avoid ItReal-World ExamplesHow This Shows Up in InterviewsTest Your UnderstandingQuick RecapRelated Concepts