System design crash course

TL;DR

System design interviews are open-ended conversations. You lead. Clarify scope, then iterate from simple to scaled.
Every architectural decision is a trade-off. There are no objectively correct answers, only contextually better ones.
The foundational tensions: performance vs. scalability, latency vs. throughput, availability vs. consistency.
CAP theorem says you can only pick two of three guarantees in a distributed system when partitions happen. Partition tolerance is not optional.
The bottleneck always moves. Fix the DB first, then add caches, then replicas, then shards. Don't over-engineer day one.

How to Approach a System Design Interview

Most candidates lose system design interviews not because they don't know the concepts, but because they jump to solutions before understanding the problem. The interviewer is watching how you think, not just what you know.

Here's the loop that works:

Step 1: Clarify the problem. Ask about use cases, scale, and constraints. "Do we need real-time reads or is eventual consistency acceptable?" Don't spend more than 5 minutes here.

Step 2: Estimate scale. Back-of-the-envelope numbers anchor the design. How many requests per second? How much storage over five years? These numbers determine whether you need sharding, replication, or a queue.

Step 3: Design the core components. Start simple. One server, one database. Then identify the bottlenecks. Don't pre-optimize.

Step 4: Scale iteratively. Add load balancers when you have multiple app servers. Add caches when reads dominate. Add replicas when your DB is the bottleneck. Add shards when a single DB can't hold the data.

Clarify → Estimate → Design (simple) → Identify Bottleneck → Scale → Repeat

Say this in your interview

"I'll start with a simple design and then identify bottlenecks to scale from there." This one sentence tells the interviewer you think iteratively and won't over-engineer from the start.

The most common failure mode: spending 20 minutes designing a globally distributed system when the requirements say "1 million users." Match your design to the scale you're given.

The Three Core Trade-offs

Before picking any component, you need these three trade-offs in your head. They come up in every system design question, whether you're asked about them explicitly or not.

Performance vs. Scalability

A system has a performance problem when it's slow for a single user. It has a scalability problem when it's fast for one user but degrades under load. These are different failure modes requiring different fixes.

Optimizing a query fixes performance. Adding a cache layer fixes scalability. Know which problem you actually have before reaching for a solution.

Latency vs. Throughput

Latency is the time it takes for one operation to complete. Throughput is how many operations you can complete per unit of time. You generally want to maximize throughput while keeping latency under an acceptable ceiling.

The tension: batching increases throughput but adds latency per operation. Real-time systems need low latency. Data pipelines need high throughput. Pick the right trade for your workload.

Availability vs. Consistency

This is the tension that everything downstream flows from. In a distributed system, when the network partitions (and it will), you have to choose: do you return potentially stale data (availability), or do you return an error (consistency)?

Neither choice is always wrong. Financial systems need consistency. Social feeds can tolerate staleness.

CAP Theorem

In any distributed system, network partitions are unavoidable. So you don't actually choose between all three, you choose between CP and AP when a partition occurs.

CP (Consistency + Partition Tolerance): The system stays consistent by refusing to serve requests that might return stale data. Banks, booking systems, anything where double-booking is catastrophic. The downside is that during a partition, you might get a timeout instead of an answer.

AP (Availability + Partition Tolerance): The system always responds, but the data might not be the latest version. DNS, shopping carts, social feeds. The upside is uninterrupted service. The downside is that two nodes might disagree for a window of time.

The common CAP misconception

CAP doesn't say you permanently sacrifice one property. It says that during a partition, you have to pick which one degrades. Most modern "AP" systems do eventually become consistent once the partition heals.

Networks aren't reliable, so you'll always need partition tolerance. The real decision is what you sacrifice during the window when the network breaks.

Consistency Patterns

Once you've decided on CP vs. AP, you need to pick a consistency model for how data replicates across nodes.

Model	Behavior	Good for
Weak consistency	After a write, reads may or may not see it. Best-effort.	VoIP, video streaming, multiplayer games
Eventual consistency	After a write, reads will eventually converge to it (typically milliseconds).	DNS, email, social feeds, shopping carts
Strong consistency	After a write, all subsequent reads reflect it immediately.	Banking, inventory, booking systems

The trade-off is clear in the diagram. Strong consistency is slower (every write waits for replication confirmation). Eventual consistency is faster but requires your application to tolerate briefly stale reads.

Availability Patterns

High availability is achieved through redundancy. Two patterns dominate:

Fail-over

Active-passive: One server handles all traffic. A second server sits idle, receiving heartbeats. When the heartbeat stops, the passive server claims the active's IP and takes over. Downtime depends on whether the passive is "warm" (already running) or "cold" (needs to boot).

Active-active: Both servers handle traffic simultaneously. The load balancer sends traffic to both. If one fails, the other absorbs its share. No cold-start delay. This is what most well-designed systems use.

Availability in Numbers

Availability is measured in "nines." Two nines (99%) sounds fine until you realize it means 87 hours of downtime per year.

SLA	Downtime per year	Downtime per month
99% (two 9s)	87.6 hours	7.3 hours
99.9% (three 9s)	8.7 hours	43.8 minutes
99.99% (four 9s)	52.6 minutes	4.4 minutes
99.999% (five 9s)	5.3 minutes	26.3 seconds

Components in sequence multiply failure risk. Two components each at 99.9% availability, placed in sequence, give you 99.8% combined. Components in parallel improve availability. Those same two components in parallel give you 99.9999%.

This is why you put databases in replication groups and load balancers in pairs.

Domain Name System (DNS)

DNS translates human-readable domain names (like api.example.com) into IP addresses. It's hierarchical: authoritative servers at the top, recursive resolvers in the middle, and TTL-controlled caches at the browser and OS level.

Key DNS routing strategies:

Weighted round-robin: Split traffic across multiple IPs in configurable proportions. Useful for A/B testing or gradual rollouts.
Latency-based routing: Send each user to the server with the lowest round-trip time. AWS Route 53 does this automatically.
Geolocation routing: Route European users to EU servers, US users to US servers. Reduces latency and helps with data residency compliance.

The main downside: DNS changes propagate slowly due to TTLs. If you need instant failover, keep TTLs short (60–300 seconds) before any planned changes.

Content Delivery Network (CDN)

A CDN is a globally distributed network of edge servers that cache and serve content from locations physically close to users. Without a CDN, every request for a static asset travels back to your origin server — potentially crossing an ocean.

Two delivery models:

Push CDN: You upload content to the CDN proactively. Content is immediately available everywhere. Best for sites with infrequent updates and moderate traffic (push once, serve forever). Downside: storage costs scale with your content library.

Pull CDN: Content stays on your origin. The CDN fetches it on first request and caches it until TTL expires. Best for high-traffic sites where popular content gets requested repeatedly. Downside: the first user for each asset waits for the origin round-trip.

Rule of thumb for CDN choice

Low traffic or infrequent updates: use Push. High traffic with frequently accessed assets: use Pull — the cache hit rate will be high and your origin won't feel the load.

CDNs also protect your origin from traffic spikes and DDoS by absorbing requests at the edge before they reach your infrastructure.

Load Balancer

A load balancer sits between your clients and your backend servers, distributing incoming traffic so no single server bears the full load. Without one, you can't scale horizontally.

Routing algorithms:

Algorithm	How it works	Best for
Round-robin	Rotate through servers in order	Identical servers, uniform request cost
Weighted round-robin	Rotate, but heavier servers get more turns	Heterogeneous server fleet
Least connections	Route to the server with fewest active connections	Long-lived connections (WebSockets)
Session/sticky	Route the same user to the same server	Apps with server-side session state (avoid this)
Layer 4	Route by IP/port only, no content inspection	Maximum throughput, minimal overhead
Layer 7	Route by URL path, headers, content	Advanced routing (e.g., `/api/*` to one service)

Layer 4 vs Layer 7: L4 load balancers operate at the transport layer — they look at IPs and ports, make routing decisions, and forward packets without reading the request body. They're fast but "dumb." L7 load balancers read the HTTP request. They can route /videos to video servers and /checkout to payment servers. More expensive computationally, but far more flexible.

The load balancer is a SPOF

A single load balancer is itself a single point of failure. Production systems run two LBs in active-passive or active-active mode. AWS ALB and GCP Cloud Load Balancing handle this for you automatically.

Load balancers also handle SSL termination (so your backend servers don't need to manage certs) and health checks (so traffic never goes to a server that's failing).

Reverse Proxy

A reverse proxy is a server that sits in front of your app servers and forwards client requests to them. From the client's perspective, it looks like one server. From your infrastructure's perspective, it's a flexible traffic router.

The distinction with a load balancer: a load balancer distributes traffic across multiple identical servers. A reverse proxy often sits in front of a single backend or a heterogeneous set of services.

Reverse proxies give you:

Security: Hide internal server details. Clients never talk directly to your app. Blacklist IPs at one point.
SSL termination: Handle HTTPS at the proxy, talk HTTP internally.
Compression: Gzip responses before sending them to clients.
Caching: Cache responses for frequently requested resources.
Static content: Serve HTML, CSS, and JS directly without hitting app servers.

NGINX and Envoy are the most common. In Kubernetes environments, an ingress controller plays this role.

Databases

Your choice of database sets the ceiling for everything else. Get it wrong and no amount of caching will save you.

Relational Databases (RDBMS)

Relational databases organize data into tables with rows and columns. They enforce schemas, provide ACID transactions, and support complex joins. Postgres and MySQL dominate. When you need strict consistency and your data has relationships you need to query across, start here.

ACID guarantees:

Atomicity: All operations in a transaction succeed or all roll back.
Consistency: Transactions bring the DB from one valid state to another.
Isolation: Concurrent transactions don't interfere.
Durability: Committed writes survive crashes.

Scaling Patterns for Relational DBs

Master-slave replication: One primary handles writes. One or more replicas receive async copies and handle reads. Great for read-heavy workloads (most apps are 80%+ reads). The downside: if the primary fails, a replica must be promoted, and you might lose recent writes.

Master-master replication: Multiple nodes accept writes. Conflict resolution is complex. Useful when you need write availability in multiple regions. Most teams avoid this unless they specifically need it.

Federation (functional partitioning): Split your monolithic database into multiple databases by domain. The users table goes to the users database; the orders table goes to the orders database. Less write contention per database, and each one can fit more data in memory for faster access. Downside: cross-database joins require application-level code.

Sharding: Split a single large table across multiple database nodes by a shard key (e.g., user ID modulo number of shards). Each shard holds a subset of rows. This eliminates the write bottleneck of a single primary. Downside: choosing the wrong shard key causes hotspots; resharding when you add nodes is painful.

Denormalization: Pre-compute joins by storing duplicate data across tables. Reads become fast without complex queries. Writes become more complex because you must keep duplicates in sync. Use when your read:write ratio is extremely high (100:1+).

SQL tuning essentials: Add indexes on query predicates. Use EXPLAIN ANALYZE to see what the planner actually does. Avoid SELECT * — fetch only the columns you need. Use CHAR for fixed-length fields and VARCHAR where lengths vary. Avoid large BLOBs in relational tables — store the reference, not the object.

NoSQL Databases

NoSQL covers four fundamentally different data models. They trade SQL's expressiveness for different scaling profiles.

Key-value store: The simplest model. A distributed hash table. O(1) reads and writes. Redis is the go-to. Everything from caching to session storage to rate limit counters. The downside: limited query expressiveness — you can only fetch by key.

Document store: Stores JSON (or JSON-like) documents. Each document is self-describing and can have different fields. MongoDB is the canonical example. Great when your data schema evolves frequently, or when you're working with hierarchical data. Downside: joins across documents require application code.

Wide-column store: Stores data as rows but with flexible columns per row. Think of it as a Map of Maps: ColumnFamily<RowKey, Columns<ColKey, Value, Timestamp>>. Cassandra and HBase are the standard choices. Optimized for massive write throughput and time-series data. The column family and row key design is critical — get it wrong and you'll get hotspot nodes.

Graph database: Stores data as nodes (entities) and edges (relationships). Optimized for graph traversals — "find all friends of friends who bought this product." Neo4j is the most common. Powerful when your data is fundamentally relational in a graph-theoretic sense, but most systems don't need it.

SQL vs. NoSQL Decision

Choose SQL when: data is structured, you need transactions, you have complex join requirements, and your team knows it well. Choose NoSQL when: schema is dynamic or unpredictable, you need to store petabytes, your workload is write-heavy at very high throughput, or you're building a cache layer.

Application Layer and Microservices

The application layer is where your business logic lives. Separating it from the web layer (the ingress point that handles HTTP) lets you scale them independently.

Service discovery is the mechanism by which services find each other. Two patterns:

Client-side discovery: Each service queries a registry (like Consul) directly and load-balances calls itself.
Server-side discovery: A router/proxy queries the registry and handles load balancing transparently (e.g., AWS ALB + ECS).

Breaking a monolith into microservices lets each team own a service and deploy independently. But it also introduces distributed systems complexity: network calls fail, services have independent failure modes, and data consistency across service boundaries is hard. Don't split unless you're feeling the pain of a single codebase.

Caching

Caching is the single highest-leverage tool in a system designer's toolkit. A cache hit costs microseconds. A database query costs milliseconds. That's a 1000x difference.

Where to Cache

Multiple layers can coexist. Browser caches reduce repeat requests. CDN caches reduce origin load. Application-level Redis caches are the workhorse for backend caching.

Caching at the query level: Hash a DB query string and store the result. Simple to implement. Fragile — if any cell in the result changes, you need to invalidate all cached queries that might have included that data.

Caching at the object level: Cache entire domain objects (e.g., a User object). Easier to reason about. Invalidation is simpler — when the user changes, evict the user cache key.

Cache Update Strategies

This is where most interviews go deep. Know all four patterns.

Strategy	How it works	Pro	Con
Cache-aside	App checks cache first, loads from DB on miss, then writes to cache	Only caches what's actually requested	Cache miss = 3 trips (check cache, read DB, write cache)
Write-through	App writes to cache; cache synchronously writes to DB	Cache always fresh	Every write is slower; cache fills with cold data
Write-behind	App writes to cache; cache writes to DB asynchronously	Fastest writes	Data loss risk if cache crashes before async flush
Refresh-ahead	Cache predicts what you'll need and pre-loads before TTL expires	Lowest read latency	Prediction errors waste cache capacity

// Cache-aside pattern example
async function getUser(userId: string) {
  // 1. Check cache
  const cached = await redis.get(`user:${userId}`);
  if (cached) return JSON.parse(cached);

  // 2. Cache miss — fetch from DB
  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

  // 3. Populate cache with TTL
  await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));

  return user;
}

Cache invalidation is the hard part

Phil Karlton's famous quote: "There are only two hard things in computer science: cache invalidation and naming things." When your source-of-truth changes, how does the cache learn about it? TTL-based expiry is the simplest. Event-driven invalidation is precise but complex to implement correctly.

Cache-aside is the default starting point for most systems. Use write-through when you can't tolerate stale cache entries. Use write-behind only when write throughput is the primary bottleneck and you've explicitly accepted the risk of data loss.

Asynchronism

Synchronous calls make the user wait while your system does work. Asynchronism offloads work so the user gets an immediate acknowledgment and the heavy processing happens in the background.

Message Queues

Message queues decouple producers from consumers. A producer publishes a message. The queue holds it. A consumer reads it when ready. If the consumer is slow, messages accumulate rather than cascading failures back to the producer.

Common choices: Redis (simple, but messages can be lost on crash), RabbitMQ (AMQP protocol, reliable, requires self-management), Amazon SQS (managed, at-least-once delivery, slight added latency), Apache Kafka (high throughput, ordered within partitions, replay-capable).

Task Queues

Task queues are a type of message queue focused on job execution. A task represents a unit of work (resize an image, send an email, recalculate a recommendation). Workers pick them up and execute them. Celery (Python) and BullMQ (Node.js) are popular.

Back Pressure

Back pressure is the signal a consumer sends to a producer when it can't keep up. Without back pressure, a fast producer overwhelms a slow consumer and the queue grows until memory runs out.

Three ways to apply back pressure:

Reject: When the queue is full, the producer gets an error and retries later.
Block: The producer is told to wait (useful in streaming pipelines).
Drop: The system discards new messages (only acceptable for non-critical data like metrics sampling).

In your interview

When you add a queue to your design, the interviewer will ask "what happens if the queue fills up?" That's the back pressure question. Have an answer ready: bounded queue with rejection, consumer scaling trigger, or dropping low-priority messages.

Communication Protocols

How services talk to each other is as important as what they say. The choice of protocol affects performance, debuggability, and coupling.

HTTP

HTTP is the application-level protocol underlying the entire web. A request has a verb and a resource; a response has a status code and a body. Stateless by design (each request carries all the context it needs).

Verb	Action	Idempotent?	Safe?
GET	Fetch a resource	Yes	Yes
POST	Create a resource or trigger an action	No	No
PUT	Replace a resource entirely	Yes	No
PATCH	Partially update a resource	No	No
DELETE	Remove a resource	Yes	No

Idempotent means calling it multiple times produces the same result as calling it once. This matters for retry logic.

TCP vs. UDP

TCP guarantees delivery and ordering, at the cost of overhead (handshake, acknowledgments, retransmission). UDP skips all that — it fires datagrams and forgets them. For video calls and games, a dropped packet is better than a delayed one. Use UDP when latency matters more than reliability.

Remote Procedure Call (RPC)

RPC makes a network call look like a local function call. The messaging details are hidden behind generated client stubs. Popular frameworks: gRPC (Google, uses Protocol Buffers), Thrift (Meta), Twirp.

RPC is best suited for internal service-to-service calls where you control both ends. The strong typing (from Protobuf schemas) catches API mismatches at compile time. Downside: tighter coupling than REST — consumers need to use the generated client.

REST

REST is an architectural style for designing APIs over HTTP. Key constraints: stateless, cacheable, uniform interface (resources identified by URLs, changed by verbs).

// REST API design example
// Resource-based URLs, HTTP verbs signal the action

GET    /users/1234          // Fetch a specific user
POST   /users               // Create a new user
PUT    /users/1234          // Replace the entire user record
PATCH  /users/1234          // Update specific fields
DELETE /users/1234          // Delete the user

// Nested resources for relationships
GET    /users/1234/orders   // Orders belonging to user 1234
POST   /users/1234/orders   // Create an order for user 1234

REST vs. RPC trade-offs:

Dimension	REST	RPC (gRPC)
Interface	HTTP verbs + resource URLs	Named procedures
Payload	JSON (verbose)	Binary Protobuf (compact)
Typing	Loose (JSON schema or OpenAPI)	Strong (generated stubs)
Browser support	Native	Requires proxy (grpc-web)
Best for	Public APIs, external clients	Internal service mesh

Use REST for external APIs and browser clients. Use gRPC internally when you need performance and type safety across service boundaries.

Security Fundamentals

Security is a cross-cutting concern. These are the non-negotiables for any system design discussion:

The non-negotiables:

Encrypt in transit and at rest. TLS everywhere, even for internal service calls in a service mesh (mTLS). Encrypt database volumes and backup snapshots.
Parameterized queries. Never concatenate user input into SQL strings. Use prepared statements. Full stop.
Output encoding. Before rendering user-supplied content in HTML, encode it. This prevents XSS.
Principle of least privilege. Services should only have the database permissions they need. An API server that only reads should only have SELECT grants, not INSERT or DROP.
Rate limiting. Protect login endpoints, API keys, and any expensive operation from brute force and abuse.

OWASP Top 10 — know these

SQL injection, broken authentication, sensitive data exposure, IDOR, security misconfiguration, XSS, and SSRF are the most commonly exploited vulnerabilities. If an interviewer asks about security, cover at least injection prevention and the principle of least privilege.

In a system design interview, you don't need to go deep on security unless it's the focus. Mention it once ("I'd secure API communication with TLS and authenticate requests using JWTs validated at the API gateway"), then move on unless the interviewer pushes.

Latency Numbers Every Engineer Should Know

These numbers don't change much year to year. They tell you what's fast, what's slow, and where caches are worth adding.

Operation	Latency	Notes
L1 cache reference	0.5 ns	On-chip, almost instant
L2 cache reference	7 ns	14× L1
Main memory (RAM)	100 ns	200× L1
Read 4KB from SSD	150 µs	~1 GB/s SSD
Read 1 MB from RAM	250 µs	Sequential read
Round-trip in same datacenter	500 µs	0.5ms
Read 1 MB from SSD	1 ms
HDD disk seek	10 ms	Avoid at all costs
Read 1 MB from HDD	30 ms
US east → west coast	40 ms	Cross-continental
US → Europe	80 ms	Transatlantic
US → Asia	150 ms

What these numbers tell you:

RAM is 200× faster than disk. This is why every caching strategy exists.
An SSD random read is 30× faster than a hard disk seek. Use SSDs for databases.
A round-trip inside your datacenter (500µs) is 300× faster than a transatlantic call (150ms). Keep your cache close to your application.
If a cache hit takes 0.25ms and a DB query takes 10ms, a 90% cache hit rate makes reads feel 37× faster on average.

Test Your Understanding

Quick Recap

Lead the interview: Clarify scope, estimate scale, design simple, identify bottlenecks, then scale. Never jump to a globally distributed architecture before the requirements justify it.
Know your core trade-offs: Performance vs. scalability, latency vs. throughput, availability vs. consistency. These dictate every architectural decision downstream.
CAP means choosing between AP and CP during partitions. Networks fail. AP systems stay up with stale data; CP systems refuse requests to stay consistent.
Scale databases incrementally: read replicas → federation → sharding. Add caching before you add replicas. Add replicas before you add shards.
Caching is your highest-leverage tool. Cache-aside is the default; understand all four update strategies (cache-aside, write-through, write-behind, refresh-ahead) and their failure modes.
Queues decouple producers from consumers and absorb traffic spikes. Always answer the back pressure question: what happens when the queue fills up?
Use REST for external APIs, RPC for internal service calls. Use TCP for reliable delivery, UDP for latency-sensitive real-time data.
Availability compounds. Components in sequence multiply downtime risk. Components in parallel reduce it. Do the math when you make SLA commitments.

Scalability — Deep dive into horizontal vs. vertical scaling, stateless services, and the scaling ladder.
Caching — Full treatment of cache topologies, Redis internals, and eviction policies.
Load Balancing — Layer 4 vs. Layer 7, routing algorithms, and health check strategies.
Message Queues — Kafka vs. SQS vs. RabbitMQ, consumer group design, and partition strategies.
Databases — SQL vs. NoSQL decision framework, indexing, and query optimization.