Load balancing
Learn how load balancers distribute traffic across servers, which algorithms to choose, and how to design a highly-available app tier in any system design interview.
TL;DR
- A load balancer sits in front of your server pool and distributes incoming requests so no single instance bears all the traffic.
- It's what makes horizontal scaling work: adding servers is useless unless something routes traffic to them.
- Layer 4 (TCP/UDP) load balancers are faster; Layer 7 (HTTP) load balancers are smarter β they route by URL, header, or cookie, and terminate SSL.
- The algorithm matters: round-robin is the default, least-connections wins for long-lived sockets, and sticky sessions are a trap unless you understand their failure modes.
- The fundamental trade-off: the load balancer itself is now a single point of failure that must be made highly available.
The Problem It Solves
Your startup just got featured on a tech blog. At 8:47 a.m., 50,000 people click the article simultaneously. Your app server process β the only one β pegs at 100% CPU.
The request queue fills. New connections get rejected with 503 Service Unavailable. Your on-call phone rings.
You spin up a second server. It runs fine. But all 50,000 users are still hammering the first one.
The second server sits at 0% CPU with zero traffic β because nobody is routing requests to it. I've seen this exact scenario play out in interviews: candidates propose horizontal scaling but forget the routing layer entirely.
That's the problem. Having more servers solves nothing if traffic has no mechanism to reach them.
The scaling blindspot
Every "scale horizontally" textbook recommendation silently assumes a load balancer already exists. Without one, adding servers doesn't reduce load on your original server at all β it just means you have more servers doing nothing.
flowchart TD
subgraph Internet["π Internet Layer"]
Users(["π€ Users\n50K concurrent\nAll traffic to same IP"])
end
subgraph Broken["π₯ Broken State β No Load Balancer"]
Server1["βοΈ App Server 1\nCPU: 100% Β· Queue: Full\n503 errors Β· ~8s latency"]
Server2["βοΈ App Server 2\nCPU: 0% Β· Completely Idle\nReceives zero traffic"]
end
Users -->|"All 50K requests β same IP"| Server1
Server2 -.->|"Unreachable β no route to it"| Server2
The fix isn't more servers alone. The fix is a component that knows all your servers exist and can distribute traffic across all of them.
What Is It?
A load balancer is a reverse proxy that sits in front of a pool of servers and distributes incoming requests across them. It continuously monitors server health and routes traffic only to healthy instances.
Analogy: Think of an airport departure terminal with 20 check-in counters. Without a dispatcher, every passenger walks to counter 1. Counter 1 is overwhelmed; counters 2β20 are empty.
With a dispatcher at the entrance directing passengers β "Counter 5 has the shortest queue, go there" β each counter handles a proportional share and passengers clear in minutes. The dispatcher doesn't do any checking-in; their only job is directing traffic efficiently. I'll often use this analogy in interviews β it makes the separation of concerns immediately obvious.
flowchart TD
subgraph Internet["π Internet Layer"]
Users(["π€ Users\n50K concurrent requests\nSingle DNS entry β VIP"])
end
subgraph LBTier["π Load Balancer Tier β Active/Passive HA"]
LB["π Primary Load Balancer\nHealth checks Β· Algorithm routing\nSSL termination Β· Connection draining"]
LB_Standby["π Standby Load Balancer\nPassive β promoted on primary failure\nShared VIP via VRRP / cloud HA"]
end
subgraph AppTier["βοΈ Stateless App Tier β Auto-Scaled"]
AS1["βοΈ App Server 1\nStateless Β· Any request handled\nCPU: ~33% under even load"]
AS2["βοΈ App Server 2\nStateless Β· Any request handled\nCPU: ~33% under even load"]
AS3["βοΈ App Server N\nStateless Β· Auto-added on scale event\nAuto-removed on scale-in"]
end
subgraph SessionStore["β‘ Session Store"]
Redis["β‘ Redis\nSessions Β· Rate limit counters\nShared by all app servers Β· < 1ms reads"]
end
Users -->|"HTTPS Β· DNS resolves to VIP"| LB
LB -.->|"Failover Β· VRRP heartbeat"| LB_Standby
LB -->|"Route Β· round-robin / least-conn"| AS1 & AS2 & AS3
AS1 & AS2 & AS3 -->|"Session reads / writes"| Redis
The load balancer gives every server in the pool a fair share of work and hides individual server failures from users entirely. A server going down doesn't degrade the service β the load balancer simply stops routing to it. Stateless app servers plus a load balancer: that's the foundation every scalable system starts with.
How It Works
Here's exactly what happens when a user's request hits a load balancer:
-
DNS resolution β The client resolves
api.yoursite.comto a single Virtual IP (VIP) address. The VIP is owned by the load balancer, not any backend server. This decoupling is what allows backend instances to be added, removed, or replaced without any DNS change. -
Connection established β At Layer 4, the LB terminates the TCP connection from the client and opens a new one to the chosen backend. At Layer 7, it also parses the HTTP request before making a routing decision.
-
Algorithm selects a backend β The load balancer runs its assignment algorithm to pick one healthy server from the pool. (Algorithms are covered in the next section.)
-
Health check gate β Before routing, and continuously during operation, the LB probes each backend. Only servers that pass health checks are eligible for traffic. A server that returns errors or fails to respond within the timeout is removed from rotation automatically.
-
Request forwarded β The request is proxied to the selected backend. For L7 balancers, headers are injected here:
X-Forwarded-For: <client-ip>,X-Request-ID: <trace-id>. -
Response returned β The backend responds to the LB; the LB returns the response to the original client. From the client's perspective, it's talking to one server. The load balancer is completely transparent.
-
Connection tracking β For long-lived connections (WebSockets, gRPC streaming), the LB pins the entire session to the same backend until the connection closes.
sequenceDiagram
participant C as π€ Client
participant LB as π LB (Layer 7)
participant AS as βοΈ App Server
C->>LB: TCP SYN β VIP:443
Note over LB: TLS handshake β decrypt here
C->>LB: HTTPS GET /api/data
Note over LB: Parse HTTP headers<br/>Select backend: least-conn<br/>Health gate passed β route to AS
LB->>AS: HTTP GET /api/data<br/>X-Forwarded-For: client-ip<br/>X-Request-ID: trace-id
activate AS
Note over AS: Handle request
AS-->>LB: HTTP 200 OK + payload
deactivate AS
Note over LB: Log: path Β· status Β· latency
LB-->>C: HTTPS 200 OK + payload
The LB is a transparent proxy: the client connects only to the VIP, TLS terminates at the LB, and the backend IP is never exposed to the client. The 1β3ms overhead is the cost of parsing headers and selecting a backend β in my experience, it's invisible at any realistic HTTP traffic volume.
Here's what a minimal NGINX upstream config looks like in practice:
# nginx.conf β upstream pool with algorithm and health configuration
upstream api_servers {
least_conn; # Route to the instance with fewest active connections
server app-server-1.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
server app-server-2.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
server app-server-3.internal:3000 weight=2 max_fails=3 fail_timeout=30s; # 2Γ capacity
keepalive 32; # Keep up to 32 idle upstream connections warm
}
server {
listen 443 ssl;
server_name api.yoursite.com;
# SSL terminates here β backends get plain HTTP internally
ssl_certificate /etc/ssl/certs/api.crt;
ssl_certificate_key /etc/ssl/private/api.key;
location / {
proxy_pass http://api_servers;
proxy_set_header X-Forwarded-For $remote_addr; # Pass real client IP
proxy_set_header X-Request-ID $request_id; # For distributed tracing
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
}
Interview tip: name the health check endpoint
When you mention health checks in an interview, say what they check. "A GET /health probe every 5 seconds β if 3 consecutive probes fail or return non-2xx, the instance is removed from rotation." That's specific. "Health checks run" is vague and tells the interviewer nothing about your operational thinking.
Here's what the health check decision loop looks like on the LB side:
// Pseudocode β active health check loop per backend instance
async function healthCheckLoop(server: BackendServer): Promise<void> {
while (true) {
try {
const res = await fetch(`http://${server.host}/health`, {
signal: AbortSignal.timeout(2000), // 2s timeout
});
if (res.ok) {
server.consecutiveFailures = 0;
if (server.status === "unhealthy") {
server.status = "healthy";
addToRotation(server); // Re-admit after recovery
}
} else {
server.consecutiveFailures++;
}
} catch {
server.consecutiveFailures++;
}
// Remove from pool after 3 consecutive failures
if (server.consecutiveFailures >= 3 && server.status === "healthy") {
server.status = "unhealthy";
removeFromRotation(server); // No traffic until next recovery check
}
await sleep(5000); // Re-probe every 5 seconds
}
}
The state machine that drives this logic:
flowchart TD
START(["π Instance starts\nJoins LB pool"])
HEALTHY["β
HEALTHY\nReceives full traffic share\nProbed every 5 s"]
DEGRADED["β οΈ DEGRADED\nStill in rotation\nFail count: 1β2 of 3"]
UNHEALTHY["β UNHEALTHY\nRemoved from pool\nProbed β no traffic"]
START -->|"First probe: 200 OK"| HEALTHY
HEALTHY -->|"Probe fails (1st or 2nd)"| DEGRADED
DEGRADED -->|"Next probe passes"| HEALTHY
DEGRADED -->|"3rd consecutive fail"| UNHEALTHY
UNHEALTHY -->|"Probe passes β re-admitted"| HEALTHY
The two-step removal (DEGRADED β UNHEALTHY) prevents a single flaky probe from pulling a healthy server. Recovery is immediate re-admission β which is why a slow-start policy (reduced weight for the first 60β90s) matters when a previously failing server comes back. Misconfigure the failure thresholds and you'll spend the next on-call shift chasing phantom outages.
Key Components
| Component | Role |
|---|---|
| Virtual IP (VIP) | The single IP address that DNS resolves to. Owned by the LB tier, not any backend. Allows backends to change freely without client-side DNS impact. |
| Backend Pool | The set of healthy server instances eligible to receive traffic. The LB manages membership based on health check results. |
| Health Checker | Continuously probes each backend (TCP ping, HTTP GET, or custom script). Automatically promotes or demotes backends from the pool. |
| Routing Algorithm | Selects which pool member receives the next request. Responsible for even distribution, respecting server capacity, and adapting to load skew. |
| SSL Terminator | Decrypts TLS at the LB so backends communicate over plain HTTP internally. Centralises certificate renewal and reduces per-backend CPU overhead. |
| Connection Drainer | On scale-in or rolling deploy, allows in-flight connections to complete before the backend instance is removed. Prevents mid-request drops. |
| Session Store (Redis) | Not part of the LB itself, but the external session store that makes backends stateless β so the LB can route any request to any instance without session loss. |
| Standby / Secondary LB | A passive LB instance that takes over via failover (VRRP, keepalived, or cloud HA) if the primary fails. Eliminates the LB as a single point of failure. |
Types / Variations
Layer 4 vs. Layer 7
The single most important classification β and the one I see candidates sidestep most often. It comes up in every serious system design conversation. Default to Layer 7 for HTTP workloads; drop to Layer 4 only when the question explicitly involves raw TCP or sub-millisecond routing latency.
| Feature | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Protocol | TCP / UDP | HTTP / HTTPS / gRPC / WebSocket |
| Routing basis | IP address + port only | URL path, HTTP headers, cookies, body content |
| TLS termination | Not typically | Yes β standard |
| Content-based routing | No | Yes β /api/* β API servers, /static/* β CDN origin |
| Performance | Faster β no HTTP parsing overhead | Slightly slower β must parse headers per request |
| Observability | Low β can't log HTTP status codes or paths | High β logs URL, status code, latency per request |
| AWS equivalent | NLB (Network Load Balancer) | ALB (Application Load Balancer) |
| Use cases | Raw TCP traffic, gaming servers, custom protocols | HTTP APIs, microservices, path-based routing |
Interview shortcut: default to L7
Unless the question explicitly involves raw TCP (multiplayer gaming, financial market data feeds, custom binary protocols), default to Layer 7. You get path-based routing, SSL termination, and per-request observability. The performance overhead is negligible for standard HTTP workloads.
Load Balancing Algorithms
| Algorithm | How It Works | Best For | Pitfall |
|---|---|---|---|
| Round Robin | Distributes requests sequentially: 1β2β3β1β2β3 | Homogeneous servers, stateless short-lived requests | Treats all servers equally regardless of current load or capacity |
| Weighted Round Robin | Servers get a numeric weight; proportionally more requests go to higher-weight servers | Mixed-capacity pools (different instance sizes) | Weights must be maintained manually as the pool changes |
| Least Connections | Routes each new request to the server with the fewest active connections | Long-lived connections (WebSockets, gRPC streaming) | Requires tracking per-server connection state |
| Least Response Time | Routes to the server with the lowest current average response time | Heterogeneous workloads where some requests are expensive | Requires active sampling and adds coordination overhead |
| IP Hash | hash(client_ip) % N β same client always hits same server | Legacy session affinity for stateful backends | Breaks when server count changes; fails behind NAT |
| Random | Picks a server at random from the healthy pool | Simple stateless APIs; eliminates coordination overhead | Can cause hot servers by statistical chance |
| Resource-Based | Routes based on CPU/memory metrics reported by each backend agent | Heterogeneous or variable-capacity workloads | Requires a metrics agent on every backend |
IP Hash breaks when servers change
IP Hash provides session affinity, but it's fragile. Adding or removing a server changes N in hash(ip) % N β all hash assignments shift. Existing users are suddenly routed to a different server and lose their in-memory session. This is why Redis-backed sessions are the correct solution, not IP Hash.
Hardware vs. Software vs. Cloud-Managed
| Type | Examples | Throughput | Ops Overhead | Cost |
|---|---|---|---|---|
| Hardware appliance | F5 BIG-IP, Citrix ADC | Very high β dedicated ASIC | High β physical box, firmware upgrades | Very high β $10Kβ$100K+ |
| Software | NGINX, HAProxy, Envoy | High β software-defined, runs on commodity hardware | Medium β you manage config, upgrades, HA | Low β open source |
| Cloud-managed | AWS ALB/NLB, GCP Cloud LB, Azure LB | Scales automatically | Very low β provider-managed HA, scaling | Pay per request + LCU |
For anything you're building from scratch today: cloud-managed is the default. You get automatic HA, multi-AZ redundancy, and auto-scaling for fractions of a cent per LCU. I'd only reach for HAProxy or NGINX when the requirement is explicitly on-prem or you need configuration that managed offerings don't support β otherwise you're managing infrastructure for no reason.
Trade-offs
| Pros | Cons |
|---|---|
| Eliminates app-tier SPOF β one instance down, others continue | The LB itself is now a potential SPOF (mitigated with active/standby HA or cloud-managed) |
| Enables horizontal scaling β add instances and traffic automatically distributes | One additional network hop β typically 1β3ms latency overhead |
| Zero-downtime deployments β roll instances out one at a time with connection draining | SSL termination at LB means backend traffic is unencrypted internally (mitigate with end-to-end TLS or a service mesh) |
| Health checks remove failing instances within seconds β transparent to users | Stateful protocols (WebSockets, gRPC streaming) require connection pinning or L7 session tracking |
| SSL termination centralises certificate management for all backends | Misconfigured health checks cause false positives (healthy servers pulled) or false negatives (broken servers kept in rotation) |
| Single point for access logs, metrics, and trace ID injection | Misconfigured drain timeout causes mid-request drops during deploys |
The fundamental tension here is availability vs. complexity. A load balancer solves the single-point-of-failure problem for your app tier, but introduces itself as a new component that must be made highly available, monitored, and correctly configured.
The mistake I see most often: candidates draw the load balancer in their diagram without mentioning that it now needs HA too. Address both in the same breath β the interviewer will ask if you don't.
When to Use It / When to Avoid It
So when does this actually matter in an interview? Almost always β any system with more than one server needs one. Here's the practical guide.
Use a load balancer when:
- You have 2+ backend instances that should share traffic.
- You need fault tolerance β one instance failing must not take down the service.
- You need zero-downtime rolling deployments (draining connections from instances one at a time).
- You need SSL termination at a single point rather than managing certificates on every server.
- You need path-based routing to multiple services from a single entry point.
Avoid or simplify when:
- You're in a development environment β local port forwarding or a single process is sufficient.
- You have a monolith with no traffic redundancy requirement β a plain reverse proxy (NGINX) is often enough.
- You're routing internal service-to-service (east-west) traffic at high volume β consider a service mesh (Istio, Linkerd) rather than a centralised LB per route.
- You're prototyping β get the system working first, then add the LB tier before any production deploy.
Load balancer vs. API Gateway vs. reverse proxy
These three are often conflated. A reverse proxy (NGINX serving static files) just forwards traffic to one backend. A load balancer distributes across multiple backends with health checks. An API Gateway does routing plus auth, rate limiting, and protocol transformation. In practice, products like NGINX and Envoy can do all three β the question is which capabilities you're actually configuring.
Real-World Examples
Google β Maglev Google built a custom software load balancer called Maglev that runs on commodity servers and handles over one million packets per second per machine. Maglev uses consistent hashing over a connection table of 65,537 buckets β a prime number chosen for uniform distribution β so the same connection always reaches the same backend even when backends are added or removed.
The design handles up to 640 Gbps per cluster and sits in front of every Google service globally. Maglev is a Layer 4 LB β the routing decision happens before any HTTP parsing.
Netflix β AWS ALB + Eureka + Ribbon Netflix uses AWS ALBs as primary Layer 7 load balancers in front of their microservice clusters. They supplement this with client-side load balancing via Eureka (service registry) and Ribbon (in-process LB library) β each service instance picks a backend directly, eliminating a full network hop for all internal traffic.
New instances use Ribbon's slow-start: reduced traffic weight for the first 90 seconds while their JVM JIT warms up, preventing cold instances from being overwhelmed. The slow-start pattern is worth stealing for any JVM or interpreted-language service you build.
Cloudflare β Geographic load balancing Cloudflare's load balancer operates at the DNS level. When a client resolves your API hostname, Cloudflare returns the IP of the nearest healthy origin β informed by both geographic proximity and measured round-trip latency from multiple global vantage points.
Health checks run from multiple locations every 60 seconds. If an origin starts failing, DNS responses switch to the next healthy origin in well under a minute β geographic failover with zero backend changes required.
How This Shows Up in Interviews
Every system design interview with a backend tier needs a load balancer in the first diagram you draw β not as an afterthought after the interviewer asks "but what about availability?" Draw it immediately, name the layer, and state your algorithm choice in one sentence. The load balancer signals that you understand horizontal scaling isn't just "add more servers."
When to bring it up
Draw a load balancer in the first component you sketch for any system with multiple backend instances. Don't wait to be asked. Within 3 minutes of starting your design, a sentence like: "I'll put a Layer 7 load balancer here β it handles SSL termination, distributes traffic across app server instances, and removes unhealthy ones automatically" signals that you understand the fundamentals of availability.
Don't over-explain it
The load balancer is table stakes. Interviewers expect it to be there. What they want to hear is why, and what your specific choices are β not a description of round-robin. Open with the algorithm choice and HA setup in one sentence, then move to the more interesting design decisions.
Depth expected at senior/staff level:
- Name the algorithm and justify it for this specific workload. WebSockets β least-connections. Stateless REST β round-robin. Mixed capacity β weighted.
- Proactively address the LB as a potential SPOF. Mention active/standby HA or the cloud-managed equivalent.
- Know when to use L4 vs. L7 and what that changes about the design.
- Understand that stateless app servers are a prerequisite for the LB to route correctly β not a follow-on optimisation.
- Know what connection draining is and why it matters for zero-downtime deploys.
Common follow-up questions and strong answers:
| Interviewer asks | Strong answer |
|---|---|
| "What if the load balancer itself goes down?" | "Active/standby via VRRP β the standby holds the same VIP and promotes within 5 seconds. For cloud deployments, managed LBs (ALB, NLB) are inherently multi-AZ; the provider handles HA." |
| "How do you handle WebSockets?" | "Layer 7, least-connections algorithm. The LB pins the entire WebSocket session to one backend for its lifetime. Don't use round-robin β it may distribute upgrade/message frames to different servers." |
| "Why not use IP Hash for session affinity?" | "IP Hash breaks when server count changes β all hash assignments shift. It also fails behind carrier NAT where many clients share one IP. The correct solution is stateless backends with session state in Redis." |
| "How do you do a zero-downtime deploy?" | "Connection draining: stop new requests to the instance being updated, wait for in-flight requests to complete (30s drain timeout), then replace the instance. Users never hit a server mid-deploy." |
| "L4 or L7 β which would you pick?" | "L7 almost always β I get URL-based routing, SSL termination, and per-request observability. L4 only if the protocol is raw TCP or ultra-low-latency requirements make HTTP parsing overhead unacceptable." |
Test Your Understanding
Quick Recap
- A load balancer distributes incoming traffic across a pool of healthy backend servers, eliminating the single-server bottleneck and making horizontal scaling possible. Without one, adding servers doesn't reduce load on the original.
- Layer 4 load balancers route by TCP/IP address and port β fast, protocol-agnostic. Layer 7 route by HTTP content (URL, headers, cookies) and terminate SSL β smarter, with per-request observability. Default to L7 for HTTP workloads.
- Algorithm choice drives real outcomes: round-robin for homogeneous stateless services; least-connections for long-lived sockets (WebSockets, gRPC streaming); weighted round-robin for mixed-capacity pools.
- Stateless app servers β sessions in Redis, not in process memory β are the prerequisite for the LB to route correctly. IP Hash (sticky sessions) breaks when server count changes and causes mass session loss on instance failure.
- Active health checks (
GET /health) detect dead servers, not degraded ones. Passive outlier detection is needed to catch slow or error-prone instances before they saturate and cause visible failures. - The load balancer itself must be made highly available: active/standby with a shared Virtual IP, or a cloud-managed LB that is inherently multi-AZ. A single LB node is just a new single point of failure.
- In interviews, name the algorithm and justify it, proactively address LB HA, explain stateless design as a prerequisite, and describe connection draining for zero-downtime deploys β these four together signal staff-level depth.
Related Concepts
- Scalability β Load balancing is the mechanism that makes horizontal scaling of the app tier possible. A stateless app tier behind a load balancer is the core pattern for handling 10Γ traffic spikes.
- API Gateway β An API Gateway includes a load balancer but adds routing, auth, rate limiting, and protocol translation. Know when a gateway adds value vs. when a bare load balancer is sufficient.
- Caching β Even a perfectly load-balanced app tier gets overwhelmed if every instance makes separate database reads. Caching in Redis keeps DB load flat as the number of instances grows.
- Rate Limiting β Rate limiters live at the load balancer or immediately behind it. Redis-backed distributed rate limiting prevents any single client from overwhelming the backend pool regardless of which instance handles their requests.
- Service Mesh β A service mesh handles load balancing for east-west (service-to-service) traffic, with retries, circuit breaking, and mTLS baked in. A front-end load balancer handles north-south (client-to-service) traffic. You typically need both.
- Envoy β Modern, microservices-focused
In System Design Interviews
When discussing load balancing in interviews:
- Place load balancers between every critical tier (clientβwebserver, webserverβapp, appβdb)
- Mention redundant load balancers (active-passive) to avoid SPOF
- Discuss which algorithm and why
- Consider geographic load balancing (DNS-based) for global systems