System design crash course
Everything you need to design large-scale systems — trade-offs, CAP, databases, caching, queues, and communication protocols — in one place.
TL;DR
- System design interviews are open-ended conversations. You lead. Clarify scope, then iterate from simple to scaled.
- Every architectural decision is a trade-off. There are no objectively correct answers, only contextually better ones.
- The foundational tensions: performance vs. scalability, latency vs. throughput, availability vs. consistency.
- CAP theorem says you can only pick two of three guarantees in a distributed system when partitions happen. Partition tolerance is not optional.
- The bottleneck always moves. Fix the DB first, then add caches, then replicas, then shards. Don't over-engineer day one.
How to Approach a System Design Interview
Most candidates lose system design interviews not because they don't know the concepts, but because they jump to solutions before understanding the problem. The interviewer is watching how you think, not just what you know.
Here's the loop that works:
Step 1: Clarify the problem. Ask about use cases, scale, and constraints. "Do we need real-time reads or is eventual consistency acceptable?" Don't spend more than 5 minutes here.
Step 2: Estimate scale. Back-of-the-envelope numbers anchor the design. How many requests per second? How much storage over five years? These numbers determine whether you need sharding, replication, or a queue.
Step 3: Design the core components. Start simple. One server, one database. Then identify the bottlenecks. Don't pre-optimize.
Step 4: Scale iteratively. Add load balancers when you have multiple app servers. Add caches when reads dominate. Add replicas when your DB is the bottleneck. Add shards when a single DB can't hold the data.
Clarify → Estimate → Design (simple) → Identify Bottleneck → Scale → Repeat
Say this in your interview
"I'll start with a simple design and then identify bottlenecks to scale from there." This one sentence tells the interviewer you think iteratively and won't over-engineer from the start.
The most common failure mode: spending 20 minutes designing a globally distributed system when the requirements say "1 million users." Match your design to the scale you're given.
The Three Core Trade-offs
Before picking any component, you need these three trade-offs in your head. They come up in every system design question, whether you're asked about them explicitly or not.
Performance vs. Scalability
A system has a performance problem when it's slow for a single user. It has a scalability problem when it's fast for one user but degrades under load. These are different failure modes requiring different fixes.
Optimizing a query fixes performance. Adding a cache layer fixes scalability. Know which problem you actually have before reaching for a solution.
Latency vs. Throughput
Latency is the time it takes for one operation to complete. Throughput is how many operations you can complete per unit of time. You generally want to maximize throughput while keeping latency under an acceptable ceiling.
The tension: batching increases throughput but adds latency per operation. Real-time systems need low latency. Data pipelines need high throughput. Pick the right trade for your workload.
Availability vs. Consistency
This is the tension that everything downstream flows from. In a distributed system, when the network partitions (and it will), you have to choose: do you return potentially stale data (availability), or do you return an error (consistency)?
Neither choice is always wrong. Financial systems need consistency. Social feeds can tolerate staleness.
CAP Theorem
In any distributed system, network partitions are unavoidable. So you don't actually choose between all three, you choose between CP and AP when a partition occurs.
CP (Consistency + Partition Tolerance): The system stays consistent by refusing to serve requests that might return stale data. Banks, booking systems, anything where double-booking is catastrophic. The downside is that during a partition, you might get a timeout instead of an answer.
AP (Availability + Partition Tolerance): The system always responds, but the data might not be the latest version. DNS, shopping carts, social feeds. The upside is uninterrupted service. The downside is that two nodes might disagree for a window of time.
The common CAP misconception
CAP doesn't say you permanently sacrifice one property. It says that during a partition, you have to pick which one degrades. Most modern "AP" systems do eventually become consistent once the partition heals.
Networks aren't reliable, so you'll always need partition tolerance. The real decision is what you sacrifice during the window when the network breaks.
Consistency Patterns
Once you've decided on CP vs. AP, you need to pick a consistency model for how data replicates across nodes.
| Model | Behavior | Good for |
|---|---|---|
| Weak consistency | After a write, reads may or may not see it. Best-effort. | VoIP, video streaming, multiplayer games |
| Eventual consistency | After a write, reads will eventually converge to it (typically milliseconds). | DNS, email, social feeds, shopping carts |
| Strong consistency | After a write, all subsequent reads reflect it immediately. | Banking, inventory, booking systems |
The trade-off is clear in the diagram. Strong consistency is slower (every write waits for replication confirmation). Eventual consistency is faster but requires your application to tolerate briefly stale reads.
Availability Patterns
High availability is achieved through redundancy. Two patterns dominate:
Fail-over
Active-passive: One server handles all traffic. A second server sits idle, receiving heartbeats. When the heartbeat stops, the passive server claims the active's IP and takes over. Downtime depends on whether the passive is "warm" (already running) or "cold" (needs to boot).
Active-active: Both servers handle traffic simultaneously. The load balancer sends traffic to both. If one fails, the other absorbs its share. No cold-start delay. This is what most well-designed systems use.
Availability in Numbers
Availability is measured in "nines." Two nines (99%) sounds fine until you realize it means 87 hours of downtime per year.
| SLA | Downtime per year | Downtime per month |
|---|---|---|
| 99% (two 9s) | 87.6 hours | 7.3 hours |
| 99.9% (three 9s) | 8.7 hours | 43.8 minutes |
| 99.99% (four 9s) | 52.6 minutes | 4.4 minutes |
| 99.999% (five 9s) | 5.3 minutes | 26.3 seconds |
Components in sequence multiply failure risk. Two components each at 99.9% availability, placed in sequence, give you 99.8% combined. Components in parallel improve availability. Those same two components in parallel give you 99.9999%.
This is why you put databases in replication groups and load balancers in pairs.
Domain Name System (DNS)
DNS translates human-readable domain names (like api.example.com) into IP addresses. It's hierarchical: authoritative servers at the top, recursive resolvers in the middle, and TTL-controlled caches at the browser and OS level.
Key DNS routing strategies:
- Weighted round-robin: Split traffic across multiple IPs in configurable proportions. Useful for A/B testing or gradual rollouts.
- Latency-based routing: Send each user to the server with the lowest round-trip time. AWS Route 53 does this automatically.
- Geolocation routing: Route European users to EU servers, US users to US servers. Reduces latency and helps with data residency compliance.
The main downside: DNS changes propagate slowly due to TTLs. If you need instant failover, keep TTLs short (60–300 seconds) before any planned changes.
Content Delivery Network (CDN)
A CDN is a globally distributed network of edge servers that cache and serve content from locations physically close to users. Without a CDN, every request for a static asset travels back to your origin server — potentially crossing an ocean.
Two delivery models:
Push CDN: You upload content to the CDN proactively. Content is immediately available everywhere. Best for sites with infrequent updates and moderate traffic (push once, serve forever). Downside: storage costs scale with your content library.
Pull CDN: Content stays on your origin. The CDN fetches it on first request and caches it until TTL expires. Best for high-traffic sites where popular content gets requested repeatedly. Downside: the first user for each asset waits for the origin round-trip.
Rule of thumb for CDN choice
Low traffic or infrequent updates: use Push. High traffic with frequently accessed assets: use Pull — the cache hit rate will be high and your origin won't feel the load.
CDNs also protect your origin from traffic spikes and DDoS by absorbing requests at the edge before they reach your infrastructure.
Load Balancer
A load balancer sits between your clients and your backend servers, distributing incoming traffic so no single server bears the full load. Without one, you can't scale horizontally.
Routing algorithms:
| Algorithm | How it works | Best for |
|---|---|---|
| Round-robin | Rotate through servers in order | Identical servers, uniform request cost |
| Weighted round-robin | Rotate, but heavier servers get more turns | Heterogeneous server fleet |
| Least connections | Route to the server with fewest active connections | Long-lived connections (WebSockets) |
| Session/sticky | Route the same user to the same server | Apps with server-side session state (avoid this) |
| Layer 4 | Route by IP/port only, no content inspection | Maximum throughput, minimal overhead |
| Layer 7 | Route by URL path, headers, content | Advanced routing (e.g., /api/* to one service) |
Layer 4 vs Layer 7: L4 load balancers operate at the transport layer — they look at IPs and ports, make routing decisions, and forward packets without reading the request body. They're fast but "dumb." L7 load balancers read the HTTP request. They can route /videos to video servers and /checkout to payment servers. More expensive computationally, but far more flexible.
The load balancer is a SPOF
A single load balancer is itself a single point of failure. Production systems run two LBs in active-passive or active-active mode. AWS ALB and GCP Cloud Load Balancing handle this for you automatically.
Load balancers also handle SSL termination (so your backend servers don't need to manage certs) and health checks (so traffic never goes to a server that's failing).
Reverse Proxy
A reverse proxy is a server that sits in front of your app servers and forwards client requests to them. From the client's perspective, it looks like one server. From your infrastructure's perspective, it's a flexible traffic router.
The distinction with a load balancer: a load balancer distributes traffic across multiple identical servers. A reverse proxy often sits in front of a single backend or a heterogeneous set of services.
Reverse proxies give you:
- Security: Hide internal server details. Clients never talk directly to your app. Blacklist IPs at one point.
- SSL termination: Handle HTTPS at the proxy, talk HTTP internally.
- Compression: Gzip responses before sending them to clients.
- Caching: Cache responses for frequently requested resources.
- Static content: Serve HTML, CSS, and JS directly without hitting app servers.
NGINX and Envoy are the most common. In Kubernetes environments, an ingress controller plays this role.
Databases
Your choice of database sets the ceiling for everything else. Get it wrong and no amount of caching will save you.
Relational Databases (RDBMS)
Relational databases organize data into tables with rows and columns. They enforce schemas, provide ACID transactions, and support complex joins. Postgres and MySQL dominate. When you need strict consistency and your data has relationships you need to query across, start here.
ACID guarantees:
- Atomicity: All operations in a transaction succeed or all roll back.
- Consistency: Transactions bring the DB from one valid state to another.
- Isolation: Concurrent transactions don't interfere.
- Durability: Committed writes survive crashes.
Scaling Patterns for Relational DBs
Master-slave replication: One primary handles writes. One or more replicas receive async copies and handle reads. Great for read-heavy workloads (most apps are 80%+ reads). The downside: if the primary fails, a replica must be promoted, and you might lose recent writes.
Master-master replication: Multiple nodes accept writes. Conflict resolution is complex. Useful when you need write availability in multiple regions. Most teams avoid this unless they specifically need it.
Federation (functional partitioning): Split your monolithic database into multiple databases by domain. The users table goes to the users database; the orders table goes to the orders database. Less write contention per database, and each one can fit more data in memory for faster access. Downside: cross-database joins require application-level code.
Sharding: Split a single large table across multiple database nodes by a shard key (e.g., user ID modulo number of shards). Each shard holds a subset of rows. This eliminates the write bottleneck of a single primary. Downside: choosing the wrong shard key causes hotspots; resharding when you add nodes is painful.
Denormalization: Pre-compute joins by storing duplicate data across tables. Reads become fast without complex queries. Writes become more complex because you must keep duplicates in sync. Use when your read:write ratio is extremely high (100:1+).
SQL tuning essentials: Add indexes on query predicates. Use EXPLAIN ANALYZE to see what the planner actually does. Avoid SELECT * — fetch only the columns you need. Use CHAR for fixed-length fields and VARCHAR where lengths vary. Avoid large BLOBs in relational tables — store the reference, not the object.
NoSQL Databases
NoSQL covers four fundamentally different data models. They trade SQL's expressiveness for different scaling profiles.
Key-value store: The simplest model. A distributed hash table. O(1) reads and writes. Redis is the go-to. Everything from caching to session storage to rate limit counters. The downside: limited query expressiveness — you can only fetch by key.
Document store: Stores JSON (or JSON-like) documents. Each document is self-describing and can have different fields. MongoDB is the canonical example. Great when your data schema evolves frequently, or when you're working with hierarchical data. Downside: joins across documents require application code.
Wide-column store: Stores data as rows but with flexible columns per row. Think of it as a Map of Maps: ColumnFamily<RowKey, Columns<ColKey, Value, Timestamp>>. Cassandra and HBase are the standard choices. Optimized for massive write throughput and time-series data. The column family and row key design is critical — get it wrong and you'll get hotspot nodes.
Graph database: Stores data as nodes (entities) and edges (relationships). Optimized for graph traversals — "find all friends of friends who bought this product." Neo4j is the most common. Powerful when your data is fundamentally relational in a graph-theoretic sense, but most systems don't need it.
SQL vs. NoSQL Decision
Choose SQL when: data is structured, you need transactions, you have complex join requirements, and your team knows it well. Choose NoSQL when: schema is dynamic or unpredictable, you need to store petabytes, your workload is write-heavy at very high throughput, or you're building a cache layer.
Application Layer and Microservices
The application layer is where your business logic lives. Separating it from the web layer (the ingress point that handles HTTP) lets you scale them independently.
Service discovery is the mechanism by which services find each other. Two patterns:
- Client-side discovery: Each service queries a registry (like Consul) directly and load-balances calls itself.
- Server-side discovery: A router/proxy queries the registry and handles load balancing transparently (e.g., AWS ALB + ECS).
Breaking a monolith into microservices lets each team own a service and deploy independently. But it also introduces distributed systems complexity: network calls fail, services have independent failure modes, and data consistency across service boundaries is hard. Don't split unless you're feeling the pain of a single codebase.
Caching
Caching is the single highest-leverage tool in a system designer's toolkit. A cache hit costs microseconds. A database query costs milliseconds. That's a 1000x difference.
Where to Cache
Multiple layers can coexist. Browser caches reduce repeat requests. CDN caches reduce origin load. Application-level Redis caches are the workhorse for backend caching.
Caching at the query level: Hash a DB query string and store the result. Simple to implement. Fragile — if any cell in the result changes, you need to invalidate all cached queries that might have included that data.
Caching at the object level: Cache entire domain objects (e.g., a User object). Easier to reason about. Invalidation is simpler — when the user changes, evict the user cache key.
Cache Update Strategies
This is where most interviews go deep. Know all four patterns.
| Strategy | How it works | Pro | Con |
|---|---|---|---|
| Cache-aside | App checks cache first, loads from DB on miss, then writes to cache | Only caches what's actually requested | Cache miss = 3 trips (check cache, read DB, write cache) |
| Write-through | App writes to cache; cache synchronously writes to DB | Cache always fresh | Every write is slower; cache fills with cold data |
| Write-behind | App writes to cache; cache writes to DB asynchronously | Fastest writes | Data loss risk if cache crashes before async flush |
| Refresh-ahead | Cache predicts what you'll need and pre-loads before TTL expires | Lowest read latency | Prediction errors waste cache capacity |
// Cache-aside pattern example
async function getUser(userId: string) {
// 1. Check cache
const cached = await redis.get(`user:${userId}`);
if (cached) return JSON.parse(cached);
// 2. Cache miss — fetch from DB
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
// 3. Populate cache with TTL
await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
return user;
}
Cache invalidation is the hard part
Phil Karlton's famous quote: "There are only two hard things in computer science: cache invalidation and naming things." When your source-of-truth changes, how does the cache learn about it? TTL-based expiry is the simplest. Event-driven invalidation is precise but complex to implement correctly.
Cache-aside is the default starting point for most systems. Use write-through when you can't tolerate stale cache entries. Use write-behind only when write throughput is the primary bottleneck and you've explicitly accepted the risk of data loss.
Asynchronism
Synchronous calls make the user wait while your system does work. Asynchronism offloads work so the user gets an immediate acknowledgment and the heavy processing happens in the background.
Message Queues
Message queues decouple producers from consumers. A producer publishes a message. The queue holds it. A consumer reads it when ready. If the consumer is slow, messages accumulate rather than cascading failures back to the producer.
Common choices: Redis (simple, but messages can be lost on crash), RabbitMQ (AMQP protocol, reliable, requires self-management), Amazon SQS (managed, at-least-once delivery, slight added latency), Apache Kafka (high throughput, ordered within partitions, replay-capable).
Task Queues
Task queues are a type of message queue focused on job execution. A task represents a unit of work (resize an image, send an email, recalculate a recommendation). Workers pick them up and execute them. Celery (Python) and BullMQ (Node.js) are popular.
Back Pressure
Back pressure is the signal a consumer sends to a producer when it can't keep up. Without back pressure, a fast producer overwhelms a slow consumer and the queue grows until memory runs out.
Three ways to apply back pressure:
- Reject: When the queue is full, the producer gets an error and retries later.
- Block: The producer is told to wait (useful in streaming pipelines).
- Drop: The system discards new messages (only acceptable for non-critical data like metrics sampling).
In your interview
When you add a queue to your design, the interviewer will ask "what happens if the queue fills up?" That's the back pressure question. Have an answer ready: bounded queue with rejection, consumer scaling trigger, or dropping low-priority messages.
Communication Protocols
How services talk to each other is as important as what they say. The choice of protocol affects performance, debuggability, and coupling.
HTTP
HTTP is the application-level protocol underlying the entire web. A request has a verb and a resource; a response has a status code and a body. Stateless by design (each request carries all the context it needs).
| Verb | Action | Idempotent? | Safe? |
|---|---|---|---|
| GET | Fetch a resource | Yes | Yes |
| POST | Create a resource or trigger an action | No | No |
| PUT | Replace a resource entirely | Yes | No |
| PATCH | Partially update a resource | No | No |
| DELETE | Remove a resource | Yes | No |
Idempotent means calling it multiple times produces the same result as calling it once. This matters for retry logic.
TCP vs. UDP
TCP guarantees delivery and ordering, at the cost of overhead (handshake, acknowledgments, retransmission). UDP skips all that — it fires datagrams and forgets them. For video calls and games, a dropped packet is better than a delayed one. Use UDP when latency matters more than reliability.
Remote Procedure Call (RPC)
RPC makes a network call look like a local function call. The messaging details are hidden behind generated client stubs. Popular frameworks: gRPC (Google, uses Protocol Buffers), Thrift (Meta), Twirp.
RPC is best suited for internal service-to-service calls where you control both ends. The strong typing (from Protobuf schemas) catches API mismatches at compile time. Downside: tighter coupling than REST — consumers need to use the generated client.
REST
REST is an architectural style for designing APIs over HTTP. Key constraints: stateless, cacheable, uniform interface (resources identified by URLs, changed by verbs).
// REST API design example
// Resource-based URLs, HTTP verbs signal the action
GET /users/1234 // Fetch a specific user
POST /users // Create a new user
PUT /users/1234 // Replace the entire user record
PATCH /users/1234 // Update specific fields
DELETE /users/1234 // Delete the user
// Nested resources for relationships
GET /users/1234/orders // Orders belonging to user 1234
POST /users/1234/orders // Create an order for user 1234
REST vs. RPC trade-offs:
| Dimension | REST | RPC (gRPC) |
|---|---|---|
| Interface | HTTP verbs + resource URLs | Named procedures |
| Payload | JSON (verbose) | Binary Protobuf (compact) |
| Typing | Loose (JSON schema or OpenAPI) | Strong (generated stubs) |
| Browser support | Native | Requires proxy (grpc-web) |
| Best for | Public APIs, external clients | Internal service mesh |
Use REST for external APIs and browser clients. Use gRPC internally when you need performance and type safety across service boundaries.
Security Fundamentals
Security is a cross-cutting concern. These are the non-negotiables for any system design discussion:
The non-negotiables:
-
Encrypt in transit and at rest. TLS everywhere, even for internal service calls in a service mesh (mTLS). Encrypt database volumes and backup snapshots.
-
Parameterized queries. Never concatenate user input into SQL strings. Use prepared statements. Full stop.
-
Output encoding. Before rendering user-supplied content in HTML, encode it. This prevents XSS.
-
Principle of least privilege. Services should only have the database permissions they need. An API server that only reads should only have SELECT grants, not INSERT or DROP.
-
Rate limiting. Protect login endpoints, API keys, and any expensive operation from brute force and abuse.
OWASP Top 10 — know these
SQL injection, broken authentication, sensitive data exposure, IDOR, security misconfiguration, XSS, and SSRF are the most commonly exploited vulnerabilities. If an interviewer asks about security, cover at least injection prevention and the principle of least privilege.
In a system design interview, you don't need to go deep on security unless it's the focus. Mention it once ("I'd secure API communication with TLS and authenticate requests using JWTs validated at the API gateway"), then move on unless the interviewer pushes.
Latency Numbers Every Engineer Should Know
These numbers don't change much year to year. They tell you what's fast, what's slow, and where caches are worth adding.
| Operation | Latency | Notes |
|---|---|---|
| L1 cache reference | 0.5 ns | On-chip, almost instant |
| L2 cache reference | 7 ns | 14× L1 |
| Main memory (RAM) | 100 ns | 200× L1 |
| Read 4KB from SSD | 150 µs | ~1 GB/s SSD |
| Read 1 MB from RAM | 250 µs | Sequential read |
| Round-trip in same datacenter | 500 µs | 0.5ms |
| Read 1 MB from SSD | 1 ms | |
| HDD disk seek | 10 ms | Avoid at all costs |
| Read 1 MB from HDD | 30 ms | |
| US east → west coast | 40 ms | Cross-continental |
| US → Europe | 80 ms | Transatlantic |
| US → Asia | 150 ms |
What these numbers tell you:
- RAM is 200× faster than disk. This is why every caching strategy exists.
- An SSD random read is 30× faster than a hard disk seek. Use SSDs for databases.
- A round-trip inside your datacenter (500µs) is 300× faster than a transatlantic call (150ms). Keep your cache close to your application.
- If a cache hit takes 0.25ms and a DB query takes 10ms, a 90% cache hit rate makes reads feel 37× faster on average.
Test Your Understanding
Quick Recap
- Lead the interview: Clarify scope, estimate scale, design simple, identify bottlenecks, then scale. Never jump to a globally distributed architecture before the requirements justify it.
- Know your core trade-offs: Performance vs. scalability, latency vs. throughput, availability vs. consistency. These dictate every architectural decision downstream.
- CAP means choosing between AP and CP during partitions. Networks fail. AP systems stay up with stale data; CP systems refuse requests to stay consistent.
- Scale databases incrementally: read replicas → federation → sharding. Add caching before you add replicas. Add replicas before you add shards.
- Caching is your highest-leverage tool. Cache-aside is the default; understand all four update strategies (cache-aside, write-through, write-behind, refresh-ahead) and their failure modes.
- Queues decouple producers from consumers and absorb traffic spikes. Always answer the back pressure question: what happens when the queue fills up?
- Use REST for external APIs, RPC for internal service calls. Use TCP for reliable delivery, UDP for latency-sensitive real-time data.
- Availability compounds. Components in sequence multiply downtime risk. Components in parallel reduce it. Do the math when you make SLA commitments.
Related Concepts
- Scalability — Deep dive into horizontal vs. vertical scaling, stateless services, and the scaling ladder.
- Caching — Full treatment of cache topologies, Redis internals, and eviction policies.
- Load Balancing — Layer 4 vs. Layer 7, routing algorithms, and health check strategies.
- Message Queues — Kafka vs. SQS vs. RabbitMQ, consumer group design, and partition strategies.
- Databases — SQL vs. NoSQL decision framework, indexing, and query optimization.