Load balancing

TL;DR

A load balancer sits in front of your server pool and distributes incoming requests so no single instance bears all the traffic.
It's what makes horizontal scaling work: adding servers is useless unless something routes traffic to them.
Layer 4 (TCP/UDP) load balancers are faster; Layer 7 (HTTP) load balancers are smarter — they route by URL, header, or cookie, and terminate SSL.
The algorithm matters: round-robin is the default, least-connections wins for long-lived sockets, and sticky sessions are a trap unless you understand their failure modes.
The fundamental trade-off: the load balancer itself is now a single point of failure that must be made highly available.

The Problem It Solves

Your startup just got featured on a tech blog. At 8:47 a.m., 50,000 people click the article simultaneously. Your app server process — the only one — pegs at 100% CPU.

The request queue fills. New connections get rejected with 503 Service Unavailable. Your on-call phone rings.

You spin up a second server. It runs fine. But all 50,000 users are still hammering the first one.

The second server sits at 0% CPU with zero traffic — because nobody is routing requests to it. I've seen this exact scenario play out in interviews: candidates propose horizontal scaling but forget the routing layer entirely.

That's the problem. Having more servers solves nothing if traffic has no mechanism to reach them.

The scaling blindspot

Every "scale horizontally" textbook recommendation silently assumes a load balancer already exists. Without one, adding servers doesn't reduce load on your original server at all — it just means you have more servers doing nothing.

The fix isn't more servers alone. The fix is a component that knows all your servers exist and can distribute traffic across all of them.

What Is It?

A load balancer is a reverse proxy that sits in front of a pool of servers and distributes incoming requests across them. It continuously monitors server health and routes traffic only to healthy instances.

Analogy: Think of an airport departure terminal with 20 check-in counters. Without a dispatcher, every passenger walks to counter 1. Counter 1 is overwhelmed; counters 2–20 are empty.

With a dispatcher at the entrance directing passengers — "Counter 5 has the shortest queue, go there" — each counter handles a proportional share and passengers clear in minutes. The dispatcher doesn't do any checking-in; their only job is directing traffic efficiently. I'll often use this analogy in interviews — it makes the separation of concerns immediately obvious.

The load balancer gives every server in the pool a fair share of work and hides individual server failures from users entirely. A server going down doesn't degrade the service — the load balancer simply stops routing to it. Stateless app servers plus a load balancer: that's the foundation every scalable system starts with.

How It Works

Here's exactly what happens when a user's request hits a load balancer:

DNS resolution — The client resolves api.yoursite.com to a single Virtual IP (VIP) address. The VIP is owned by the load balancer, not any backend server. This decoupling is what allows backend instances to be added, removed, or replaced without any DNS change.
Connection established — At Layer 4, the LB terminates the TCP connection from the client and opens a new one to the chosen backend. At Layer 7, it also parses the HTTP request before making a routing decision.
Algorithm selects a backend — The load balancer runs its assignment algorithm to pick one healthy server from the pool. (Algorithms are covered in the next section.)
Health check gate — Before routing, and continuously during operation, the LB probes each backend. Only servers that pass health checks are eligible for traffic. A server that returns errors or fails to respond within the timeout is removed from rotation automatically.
Request forwarded — The request is proxied to the selected backend. For L7 balancers, headers are injected here: X-Forwarded-For: <client-ip>, X-Request-ID: <trace-id>.
Response returned — The backend responds to the LB; the LB returns the response to the original client. From the client's perspective, it's talking to one server. The load balancer is completely transparent.
Connection tracking — For long-lived connections (WebSockets, gRPC streaming), the LB pins the entire session to the same backend until the connection closes.

The LB is a transparent proxy: the client connects only to the VIP, TLS terminates at the LB, and the backend IP is never exposed to the client. The 1–3ms overhead is the cost of parsing headers and selecting a backend — in my experience, it's invisible at any realistic HTTP traffic volume.

Here's what a minimal NGINX upstream config looks like in practice:

# nginx.conf — upstream pool with algorithm and health configuration
upstream api_servers {
  least_conn;  # Route to the instance with fewest active connections

  server app-server-1.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
  server app-server-2.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
  server app-server-3.internal:3000 weight=2 max_fails=3 fail_timeout=30s; # 2× capacity

  keepalive 32;  # Keep up to 32 idle upstream connections warm
}

server {
  listen 443 ssl;
  server_name api.yoursite.com;

  # SSL terminates here — backends get plain HTTP internally
  ssl_certificate     /etc/ssl/certs/api.crt;
  ssl_certificate_key /etc/ssl/private/api.key;

  location / {
    proxy_pass http://api_servers;
    proxy_set_header X-Forwarded-For $remote_addr;  # Pass real client IP
    proxy_set_header X-Request-ID    $request_id;   # For distributed tracing
    proxy_connect_timeout 5s;
    proxy_read_timeout    30s;
  }
}

Interview tip: name the health check endpoint

When you mention health checks in an interview, say what they check. "A GET /health probe every 5 seconds — if 3 consecutive probes fail or return non-2xx, the instance is removed from rotation." That's specific. "Health checks run" is vague and tells the interviewer nothing about your operational thinking.

Here's what the health check decision loop looks like on the LB side:

// Pseudocode — active health check loop per backend instance
async function healthCheckLoop(server: BackendServer): Promise<void> {
  while (true) {
    try {
      const res = await fetch(`http://${server.host}/health`, {
        signal: AbortSignal.timeout(2000), // 2s timeout
      });

      if (res.ok) {
        server.consecutiveFailures = 0;
        if (server.status === "unhealthy") {
          server.status = "healthy";
          addToRotation(server); // Re-admit after recovery
        }
      } else {
        server.consecutiveFailures++;
      }
    } catch {
      server.consecutiveFailures++;
    }

    // Remove from pool after 3 consecutive failures
    if (server.consecutiveFailures >= 3 && server.status === "healthy") {
      server.status = "unhealthy";
      removeFromRotation(server); // No traffic until next recovery check
    }

    await sleep(5000); // Re-probe every 5 seconds
  }
}

The state machine that drives this logic:

The two-step removal (DEGRADED → UNHEALTHY) prevents a single flaky probe from pulling a healthy server. Recovery is immediate re-admission — which is why a slow-start policy (reduced weight for the first 60–90s) matters when a previously failing server comes back. Misconfigure the failure thresholds and you'll spend the next on-call shift chasing phantom outages.

Key Components

Component	Role
Virtual IP (VIP)	The single IP address that DNS resolves to. Owned by the LB tier, not any backend. Allows backends to change freely without client-side DNS impact.
Backend Pool	The set of healthy server instances eligible to receive traffic. The LB manages membership based on health check results.
Health Checker	Continuously probes each backend (TCP ping, HTTP GET, or custom script). Automatically promotes or demotes backends from the pool.
Routing Algorithm	Selects which pool member receives the next request. Responsible for even distribution, respecting server capacity, and adapting to load skew.
SSL Terminator	Decrypts TLS at the LB so backends communicate over plain HTTP internally. Centralises certificate renewal and reduces per-backend CPU overhead.
Connection Drainer	On scale-in or rolling deploy, allows in-flight connections to complete before the backend instance is removed. Prevents mid-request drops.
Session Store (Redis)	Not part of the LB itself, but the external session store that makes backends stateless — so the LB can route any request to any instance without session loss.
Standby / Secondary LB	A passive LB instance that takes over via failover (VRRP, keepalived, or cloud HA) if the primary fails. Eliminates the LB as a single point of failure.

Types / Variations

Layer 4 vs. Layer 7

The single most important classification — and the one I see candidates sidestep most often. It comes up in every serious system design conversation. Default to Layer 7 for HTTP workloads; drop to Layer 4 only when the question explicitly involves raw TCP or sub-millisecond routing latency.

Feature	Layer 4 (Transport)	Layer 7 (Application)
Protocol	TCP / UDP	HTTP / HTTPS / gRPC / WebSocket
Routing basis	IP address + port only	URL path, HTTP headers, cookies, body content
TLS termination	Not typically	Yes — standard
Content-based routing	No	Yes — `/api/` → API servers, `/static/` → CDN origin
Performance	Faster — no HTTP parsing overhead	Slightly slower — must parse headers per request
Observability	Low — can't log HTTP status codes or paths	High — logs URL, status code, latency per request
AWS equivalent	NLB (Network Load Balancer)	ALB (Application Load Balancer)
Use cases	Raw TCP traffic, gaming servers, custom protocols	HTTP APIs, microservices, path-based routing

Interview shortcut: default to L7

Unless the question explicitly involves raw TCP (multiplayer gaming, financial market data feeds, custom binary protocols), default to Layer 7. You get path-based routing, SSL termination, and per-request observability. The performance overhead is negligible for standard HTTP workloads.

Load Balancing Algorithms

Algorithm	How It Works	Best For	Pitfall
Round Robin	Distributes requests sequentially: 1→2→3→1→2→3	Homogeneous servers, stateless short-lived requests	Treats all servers equally regardless of current load or capacity
Weighted Round Robin	Servers get a numeric weight; proportionally more requests go to higher-weight servers	Mixed-capacity pools (different instance sizes)	Weights must be maintained manually as the pool changes
Least Connections	Routes each new request to the server with the fewest active connections	Long-lived connections (WebSockets, gRPC streaming)	Requires tracking per-server connection state
Least Response Time	Routes to the server with the lowest current average response time	Heterogeneous workloads where some requests are expensive	Requires active sampling and adds coordination overhead
IP Hash	`hash(client_ip) % N` — same client always hits same server	Legacy session affinity for stateful backends	Breaks when server count changes; fails behind NAT
Random	Picks a server at random from the healthy pool	Simple stateless APIs; eliminates coordination overhead	Can cause hot servers by statistical chance
Resource-Based	Routes based on CPU/memory metrics reported by each backend agent	Heterogeneous or variable-capacity workloads	Requires a metrics agent on every backend

IP Hash breaks when servers change

IP Hash provides session affinity, but it's fragile. Adding or removing a server changes N in hash(ip) % N — all hash assignments shift. Existing users are suddenly routed to a different server and lose their in-memory session. This is why Redis-backed sessions are the correct solution, not IP Hash.

Hardware vs. Software vs. Cloud-Managed

Type	Examples	Throughput	Ops Overhead	Cost
Hardware appliance	F5 BIG-IP, Citrix ADC	Very high — dedicated ASIC	High — physical box, firmware upgrades	Very high — $10K–$100K+
Software	NGINX, HAProxy, Envoy	High — software-defined, runs on commodity hardware	Medium — you manage config, upgrades, HA	Low — open source
Cloud-managed	AWS ALB/NLB, GCP Cloud LB, Azure LB	Scales automatically	Very low — provider-managed HA, scaling	Pay per request + LCU

For anything you're building from scratch today: cloud-managed is the default. You get automatic HA, multi-AZ redundancy, and auto-scaling for fractions of a cent per LCU. I'd only reach for HAProxy or NGINX when the requirement is explicitly on-prem or you need configuration that managed offerings don't support — otherwise you're managing infrastructure for no reason.

Trade-offs

Pros	Cons
Eliminates app-tier SPOF — one instance down, others continue	The LB itself is now a potential SPOF (mitigated with active/standby HA or cloud-managed)
Enables horizontal scaling — add instances and traffic automatically distributes	One additional network hop — typically 1–3ms latency overhead
Zero-downtime deployments — roll instances out one at a time with connection draining	SSL termination at LB means backend traffic is unencrypted internally (mitigate with end-to-end TLS or a service mesh)
Health checks remove failing instances within seconds — transparent to users	Stateful protocols (WebSockets, gRPC streaming) require connection pinning or L7 session tracking
SSL termination centralises certificate management for all backends	Misconfigured health checks cause false positives (healthy servers pulled) or false negatives (broken servers kept in rotation)
Single point for access logs, metrics, and trace ID injection	Misconfigured drain timeout causes mid-request drops during deploys

The fundamental tension here is availability vs. complexity. A load balancer solves the single-point-of-failure problem for your app tier, but introduces itself as a new component that must be made highly available, monitored, and correctly configured.

The mistake I see most often: candidates draw the load balancer in their diagram without mentioning that it now needs HA too. Address both in the same breath — the interviewer will ask if you don't.

When to Use It / When to Avoid It

So when does this actually matter in an interview? Almost always — any system with more than one server needs one. Here's the practical guide.

Use a load balancer when:

You have 2+ backend instances that should share traffic.
You need fault tolerance — one instance failing must not take down the service.
You need zero-downtime rolling deployments (draining connections from instances one at a time).
You need SSL termination at a single point rather than managing certificates on every server.
You need path-based routing to multiple services from a single entry point.

Avoid or simplify when:

You're in a development environment — local port forwarding or a single process is sufficient.
You have a monolith with no traffic redundancy requirement — a plain reverse proxy (NGINX) is often enough.
You're routing internal service-to-service (east-west) traffic at high volume — consider a service mesh (Istio, Linkerd) rather than a centralised LB per route.
You're prototyping — get the system working first, then add the LB tier before any production deploy.

Load balancer vs. API Gateway vs. reverse proxy

These three are often conflated. A reverse proxy (NGINX serving static files) just forwards traffic to one backend. A load balancer distributes across multiple backends with health checks. An API Gateway does routing plus auth, rate limiting, and protocol transformation. In practice, products like NGINX and Envoy can do all three — the question is which capabilities you're actually configuring.

Real-World Examples

Google — Maglev Google built a custom software load balancer called Maglev that runs on commodity servers and handles over one million packets per second per machine. Maglev uses consistent hashing over a connection table of 65,537 buckets — a prime number chosen for uniform distribution — so the same connection always reaches the same backend even when backends are added or removed.

The design handles up to 640 Gbps per cluster and sits in front of every Google service globally. Maglev is a Layer 4 LB — the routing decision happens before any HTTP parsing.

Netflix — AWS ALB + Eureka + Ribbon Netflix uses AWS ALBs as primary Layer 7 load balancers in front of their microservice clusters. They supplement this with client-side load balancing via Eureka (service registry) and Ribbon (in-process LB library) — each service instance picks a backend directly, eliminating a full network hop for all internal traffic.

New instances use Ribbon's slow-start: reduced traffic weight for the first 90 seconds while their JVM JIT warms up, preventing cold instances from being overwhelmed. The slow-start pattern is worth stealing for any JVM or interpreted-language service you build.

Cloudflare — Geographic load balancing Cloudflare's load balancer operates at the DNS level. When a client resolves your API hostname, Cloudflare returns the IP of the nearest healthy origin — informed by both geographic proximity and measured round-trip latency from multiple global vantage points.

Health checks run from multiple locations every 60 seconds. If an origin starts failing, DNS responses switch to the next healthy origin in well under a minute — geographic failover with zero backend changes required.

How This Shows Up in Interviews

Every system design interview with a backend tier needs a load balancer in the first diagram you draw — not as an afterthought after the interviewer asks "but what about availability?" Draw it immediately, name the layer, and state your algorithm choice in one sentence. The load balancer signals that you understand horizontal scaling isn't just "add more servers."

When to bring it up

Draw a load balancer in the first component you sketch for any system with multiple backend instances. Don't wait to be asked. Within 3 minutes of starting your design, a sentence like: "I'll put a Layer 7 load balancer here — it handles SSL termination, distributes traffic across app server instances, and removes unhealthy ones automatically" signals that you understand the fundamentals of availability.

Don't over-explain it

The load balancer is table stakes. Interviewers expect it to be there. What they want to hear is why, and what your specific choices are — not a description of round-robin. Open with the algorithm choice and HA setup in one sentence, then move to the more interesting design decisions.

Depth expected at senior/staff level:

Name the algorithm and justify it for this specific workload. WebSockets → least-connections. Stateless REST → round-robin. Mixed capacity → weighted.
Proactively address the LB as a potential SPOF. Mention active/standby HA or the cloud-managed equivalent.
Know when to use L4 vs. L7 and what that changes about the design.
Understand that stateless app servers are a prerequisite for the LB to route correctly — not a follow-on optimisation.
Know what connection draining is and why it matters for zero-downtime deploys.

Common follow-up questions and strong answers:

Interviewer asks	Strong answer
"What if the load balancer itself goes down?"	"Active/standby via VRRP — the standby holds the same VIP and promotes within 5 seconds. For cloud deployments, managed LBs (ALB, NLB) are inherently multi-AZ; the provider handles HA."
"How do you handle WebSockets?"	"Layer 7, least-connections algorithm. The LB pins the entire WebSocket session to one backend for its lifetime. Don't use round-robin — it may distribute upgrade/message frames to different servers."
"Why not use IP Hash for session affinity?"	"IP Hash breaks when server count changes — all hash assignments shift. It also fails behind carrier NAT where many clients share one IP. The correct solution is stateless backends with session state in Redis."
"How do you do a zero-downtime deploy?"	"Connection draining: stop new requests to the instance being updated, wait for in-flight requests to complete (30s drain timeout), then replace the instance. Users never hit a server mid-deploy."
"L4 or L7 — which would you pick?"	"L7 almost always — I get URL-based routing, SSL termination, and per-request observability. L4 only if the protocol is raw TCP or ultra-low-latency requirements make HTTP parsing overhead unacceptable."

Test Your Understanding

Quick Recap

A load balancer distributes incoming traffic across a pool of healthy backend servers, eliminating the single-server bottleneck and making horizontal scaling possible. Without one, adding servers doesn't reduce load on the original.
Layer 4 load balancers route by TCP/IP address and port — fast, protocol-agnostic. Layer 7 route by HTTP content (URL, headers, cookies) and terminate SSL — smarter, with per-request observability. Default to L7 for HTTP workloads.
Algorithm choice drives real outcomes: round-robin for homogeneous stateless services; least-connections for long-lived sockets (WebSockets, gRPC streaming); weighted round-robin for mixed-capacity pools.
Stateless app servers — sessions in Redis, not in process memory — are the prerequisite for the LB to route correctly. IP Hash (sticky sessions) breaks when server count changes and causes mass session loss on instance failure.
Active health checks (GET /health) detect dead servers, not degraded ones. Passive outlier detection is needed to catch slow or error-prone instances before they saturate and cause visible failures.
The load balancer itself must be made highly available: active/standby with a shared Virtual IP, or a cloud-managed LB that is inherently multi-AZ. A single LB node is just a new single point of failure.
In interviews, name the algorithm and justify it, proactively address LB HA, explain stateless design as a prerequisite, and describe connection draining for zero-downtime deploys — these four together signal staff-level depth.

Scalability — Load balancing is the mechanism that makes horizontal scaling of the app tier possible. A stateless app tier behind a load balancer is the core pattern for handling 10× traffic spikes.
API Gateway — An API Gateway includes a load balancer but adds routing, auth, rate limiting, and protocol translation. Know when a gateway adds value vs. when a bare load balancer is sufficient.
Caching — Even a perfectly load-balanced app tier gets overwhelmed if every instance makes separate database reads. Caching in Redis keeps DB load flat as the number of instances grows.
Rate Limiting — Rate limiters live at the load balancer or immediately behind it. Redis-backed distributed rate limiting prevents any single client from overwhelming the backend pool regardless of which instance handles their requests.
Service Mesh — A service mesh handles load balancing for east-west (service-to-service) traffic, with retries, circuit breaking, and mTLS baked in. A front-end load balancer handles north-south (client-to-service) traffic. You typically need both.
Envoy — Modern, microservices-focused

In System Design Interviews

When discussing load balancing in interviews:

Place load balancers between every critical tier (client→webserver, webserver→app, app→db)
Mention redundant load balancers (active-passive) to avoid SPOF
Discuss which algorithm and why
Consider geographic load balancing (DNS-based) for global systems

TL;DR

A load balancer sits in front of your server pool and distributes incoming requests so no single instance bears all the traffic.
It's what makes horizontal scaling work: adding servers is useless unless something routes traffic to them.
Layer 4 (TCP/UDP) load balancers are faster; Layer 7 (HTTP) load balancers are smarter — they route by URL, header, or cookie, and terminate SSL.
The algorithm matters: round-robin is the default, least-connections wins for long-lived sockets, and sticky sessions are a trap unless you understand their failure modes.
The fundamental trade-off: the load balancer itself is now a single point of failure that must be made highly available.

The Problem It Solves

Your startup just got featured on a tech blog. At 8:47 a.m., 50,000 people click the article simultaneously. Your app server process — the only one — pegs at 100% CPU.

The request queue fills. New connections get rejected with 503 Service Unavailable. Your on-call phone rings.

You spin up a second server. It runs fine. But all 50,000 users are still hammering the first one.

That's the problem. Having more servers solves nothing if traffic has no mechanism to reach them.

The scaling blindspot

The fix isn't more servers alone. The fix is a component that knows all your servers exist and can distribute traffic across all of them.

What Is It?

Analogy: Think of an airport departure terminal with 20 check-in counters. Without a dispatcher, every passenger walks to counter 1. Counter 1 is overwhelmed; counters 2–20 are empty.

How It Works

Here's exactly what happens when a user's request hits a load balancer:

DNS resolution — The client resolves api.yoursite.com to a single Virtual IP (VIP) address. The VIP is owned by the load balancer, not any backend server. This decoupling is what allows backend instances to be added, removed, or replaced without any DNS change.
Connection established — At Layer 4, the LB terminates the TCP connection from the client and opens a new one to the chosen backend. At Layer 7, it also parses the HTTP request before making a routing decision.
Algorithm selects a backend — The load balancer runs its assignment algorithm to pick one healthy server from the pool. (Algorithms are covered in the next section.)
Health check gate — Before routing, and continuously during operation, the LB probes each backend. Only servers that pass health checks are eligible for traffic. A server that returns errors or fails to respond within the timeout is removed from rotation automatically.
Request forwarded — The request is proxied to the selected backend. For L7 balancers, headers are injected here: X-Forwarded-For: <client-ip>, X-Request-ID: <trace-id>.
Response returned — The backend responds to the LB; the LB returns the response to the original client. From the client's perspective, it's talking to one server. The load balancer is completely transparent.
Connection tracking — For long-lived connections (WebSockets, gRPC streaming), the LB pins the entire session to the same backend until the connection closes.

Here's what a minimal NGINX upstream config looks like in practice:

# nginx.conf — upstream pool with algorithm and health configuration
upstream api_servers {
  least_conn;  # Route to the instance with fewest active connections

  server app-server-1.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
  server app-server-2.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
  server app-server-3.internal:3000 weight=2 max_fails=3 fail_timeout=30s; # 2× capacity

  keepalive 32;  # Keep up to 32 idle upstream connections warm
}

server {
  listen 443 ssl;
  server_name api.yoursite.com;

  # SSL terminates here — backends get plain HTTP internally
  ssl_certificate     /etc/ssl/certs/api.crt;
  ssl_certificate_key /etc/ssl/private/api.key;

  location / {
    proxy_pass http://api_servers;
    proxy_set_header X-Forwarded-For $remote_addr;  # Pass real client IP
    proxy_set_header X-Request-ID    $request_id;   # For distributed tracing
    proxy_connect_timeout 5s;
    proxy_read_timeout    30s;
  }
}

Interview tip: name the health check endpoint

Here's what the health check decision loop looks like on the LB side:

// Pseudocode — active health check loop per backend instance
async function healthCheckLoop(server: BackendServer): Promise<void> {
  while (true) {
    try {
      const res = await fetch(`http://${server.host}/health`, {
        signal: AbortSignal.timeout(2000), // 2s timeout
      });

      if (res.ok) {
        server.consecutiveFailures = 0;
        if (server.status === "unhealthy") {
          server.status = "healthy";
          addToRotation(server); // Re-admit after recovery
        }
      } else {
        server.consecutiveFailures++;
      }
    } catch {
      server.consecutiveFailures++;
    }

    // Remove from pool after 3 consecutive failures
    if (server.consecutiveFailures >= 3 && server.status === "healthy") {
      server.status = "unhealthy";
      removeFromRotation(server); // No traffic until next recovery check
    }

    await sleep(5000); // Re-probe every 5 seconds
  }
}

The state machine that drives this logic:

Key Components

Component	Role
Virtual IP (VIP)	The single IP address that DNS resolves to. Owned by the LB tier, not any backend. Allows backends to change freely without client-side DNS impact.
Backend Pool	The set of healthy server instances eligible to receive traffic. The LB manages membership based on health check results.
Health Checker	Continuously probes each backend (TCP ping, HTTP GET, or custom script). Automatically promotes or demotes backends from the pool.
Routing Algorithm	Selects which pool member receives the next request. Responsible for even distribution, respecting server capacity, and adapting to load skew.
SSL Terminator	Decrypts TLS at the LB so backends communicate over plain HTTP internally. Centralises certificate renewal and reduces per-backend CPU overhead.
Connection Drainer	On scale-in or rolling deploy, allows in-flight connections to complete before the backend instance is removed. Prevents mid-request drops.
Session Store (Redis)	Not part of the LB itself, but the external session store that makes backends stateless — so the LB can route any request to any instance without session loss.
Standby / Secondary LB	A passive LB instance that takes over via failover (VRRP, keepalived, or cloud HA) if the primary fails. Eliminates the LB as a single point of failure.

Types / Variations

Layer 4 vs. Layer 7

Feature	Layer 4 (Transport)	Layer 7 (Application)
Protocol	TCP / UDP	HTTP / HTTPS / gRPC / WebSocket
Routing basis	IP address + port only	URL path, HTTP headers, cookies, body content
TLS termination	Not typically	Yes — standard
Content-based routing	No	Yes — `/api/` → API servers, `/static/` → CDN origin
Performance	Faster — no HTTP parsing overhead	Slightly slower — must parse headers per request
Observability	Low — can't log HTTP status codes or paths	High — logs URL, status code, latency per request
AWS equivalent	NLB (Network Load Balancer)	ALB (Application Load Balancer)
Use cases	Raw TCP traffic, gaming servers, custom protocols	HTTP APIs, microservices, path-based routing

Interview shortcut: default to L7

Load Balancing Algorithms

Algorithm	How It Works	Best For	Pitfall
Round Robin	Distributes requests sequentially: 1→2→3→1→2→3	Homogeneous servers, stateless short-lived requests	Treats all servers equally regardless of current load or capacity
Weighted Round Robin	Servers get a numeric weight; proportionally more requests go to higher-weight servers	Mixed-capacity pools (different instance sizes)	Weights must be maintained manually as the pool changes
Least Connections	Routes each new request to the server with the fewest active connections	Long-lived connections (WebSockets, gRPC streaming)	Requires tracking per-server connection state
Least Response Time	Routes to the server with the lowest current average response time	Heterogeneous workloads where some requests are expensive	Requires active sampling and adds coordination overhead
IP Hash	`hash(client_ip) % N` — same client always hits same server	Legacy session affinity for stateful backends	Breaks when server count changes; fails behind NAT
Random	Picks a server at random from the healthy pool	Simple stateless APIs; eliminates coordination overhead	Can cause hot servers by statistical chance
Resource-Based	Routes based on CPU/memory metrics reported by each backend agent	Heterogeneous or variable-capacity workloads	Requires a metrics agent on every backend

IP Hash breaks when servers change

Hardware vs. Software vs. Cloud-Managed

Type	Examples	Throughput	Ops Overhead	Cost
Hardware appliance	F5 BIG-IP, Citrix ADC	Very high — dedicated ASIC	High — physical box, firmware upgrades	Very high — $10K–$100K+
Software	NGINX, HAProxy, Envoy	High — software-defined, runs on commodity hardware	Medium — you manage config, upgrades, HA	Low — open source
Cloud-managed	AWS ALB/NLB, GCP Cloud LB, Azure LB	Scales automatically	Very low — provider-managed HA, scaling	Pay per request + LCU

Trade-offs

Pros	Cons
Eliminates app-tier SPOF — one instance down, others continue	The LB itself is now a potential SPOF (mitigated with active/standby HA or cloud-managed)
Enables horizontal scaling — add instances and traffic automatically distributes	One additional network hop — typically 1–3ms latency overhead
Zero-downtime deployments — roll instances out one at a time with connection draining	SSL termination at LB means backend traffic is unencrypted internally (mitigate with end-to-end TLS or a service mesh)
Health checks remove failing instances within seconds — transparent to users	Stateful protocols (WebSockets, gRPC streaming) require connection pinning or L7 session tracking
SSL termination centralises certificate management for all backends	Misconfigured health checks cause false positives (healthy servers pulled) or false negatives (broken servers kept in rotation)
Single point for access logs, metrics, and trace ID injection	Misconfigured drain timeout causes mid-request drops during deploys

When to Use It / When to Avoid It

So when does this actually matter in an interview? Almost always — any system with more than one server needs one. Here's the practical guide.

Use a load balancer when:

You have 2+ backend instances that should share traffic.
You need fault tolerance — one instance failing must not take down the service.
You need zero-downtime rolling deployments (draining connections from instances one at a time).
You need SSL termination at a single point rather than managing certificates on every server.
You need path-based routing to multiple services from a single entry point.

Avoid or simplify when:

You're in a development environment — local port forwarding or a single process is sufficient.
You have a monolith with no traffic redundancy requirement — a plain reverse proxy (NGINX) is often enough.
You're routing internal service-to-service (east-west) traffic at high volume — consider a service mesh (Istio, Linkerd) rather than a centralised LB per route.
You're prototyping — get the system working first, then add the LB tier before any production deploy.

Load balancer vs. API Gateway vs. reverse proxy

Real-World Examples

The design handles up to 640 Gbps per cluster and sits in front of every Google service globally. Maglev is a Layer 4 LB — the routing decision happens before any HTTP parsing.

How This Shows Up in Interviews

When to bring it up

Don't over-explain it

Depth expected at senior/staff level:

Name the algorithm and justify it for this specific workload. WebSockets → least-connections. Stateless REST → round-robin. Mixed capacity → weighted.
Proactively address the LB as a potential SPOF. Mention active/standby HA or the cloud-managed equivalent.
Know when to use L4 vs. L7 and what that changes about the design.
Understand that stateless app servers are a prerequisite for the LB to route correctly — not a follow-on optimisation.
Know what connection draining is and why it matters for zero-downtime deploys.

Common follow-up questions and strong answers:

Interviewer asks	Strong answer
"What if the load balancer itself goes down?"	"Active/standby via VRRP — the standby holds the same VIP and promotes within 5 seconds. For cloud deployments, managed LBs (ALB, NLB) are inherently multi-AZ; the provider handles HA."
"How do you handle WebSockets?"	"Layer 7, least-connections algorithm. The LB pins the entire WebSocket session to one backend for its lifetime. Don't use round-robin — it may distribute upgrade/message frames to different servers."
"Why not use IP Hash for session affinity?"	"IP Hash breaks when server count changes — all hash assignments shift. It also fails behind carrier NAT where many clients share one IP. The correct solution is stateless backends with session state in Redis."
"How do you do a zero-downtime deploy?"	"Connection draining: stop new requests to the instance being updated, wait for in-flight requests to complete (30s drain timeout), then replace the instance. Users never hit a server mid-deploy."
"L4 or L7 — which would you pick?"	"L7 almost always — I get URL-based routing, SSL termination, and per-request observability. L4 only if the protocol is raw TCP or ultra-low-latency requirements make HTTP parsing overhead unacceptable."

Test Your Understanding

Quick Recap

A load balancer distributes incoming traffic across a pool of healthy backend servers, eliminating the single-server bottleneck and making horizontal scaling possible. Without one, adding servers doesn't reduce load on the original.
Layer 4 load balancers route by TCP/IP address and port — fast, protocol-agnostic. Layer 7 route by HTTP content (URL, headers, cookies) and terminate SSL — smarter, with per-request observability. Default to L7 for HTTP workloads.
Algorithm choice drives real outcomes: round-robin for homogeneous stateless services; least-connections for long-lived sockets (WebSockets, gRPC streaming); weighted round-robin for mixed-capacity pools.
Stateless app servers — sessions in Redis, not in process memory — are the prerequisite for the LB to route correctly. IP Hash (sticky sessions) breaks when server count changes and causes mass session loss on instance failure.
Active health checks (GET /health) detect dead servers, not degraded ones. Passive outlier detection is needed to catch slow or error-prone instances before they saturate and cause visible failures.
The load balancer itself must be made highly available: active/standby with a shared Virtual IP, or a cloud-managed LB that is inherently multi-AZ. A single LB node is just a new single point of failure.
In interviews, name the algorithm and justify it, proactively address LB HA, explain stateless design as a prerequisite, and describe connection draining for zero-downtime deploys — these four together signal staff-level depth.

Scalability — Load balancing is the mechanism that makes horizontal scaling of the app tier possible. A stateless app tier behind a load balancer is the core pattern for handling 10× traffic spikes.
API Gateway — An API Gateway includes a load balancer but adds routing, auth, rate limiting, and protocol translation. Know when a gateway adds value vs. when a bare load balancer is sufficient.
Caching — Even a perfectly load-balanced app tier gets overwhelmed if every instance makes separate database reads. Caching in Redis keeps DB load flat as the number of instances grows.
Rate Limiting — Rate limiters live at the load balancer or immediately behind it. Redis-backed distributed rate limiting prevents any single client from overwhelming the backend pool regardless of which instance handles their requests.
Service Mesh — A service mesh handles load balancing for east-west (service-to-service) traffic, with retries, circuit breaking, and mTLS baked in. A front-end load balancer handles north-south (client-to-service) traffic. You typically need both.
Envoy — Modern, microservices-focused

In System Design Interviews

When discussing load balancing in interviews:

Place load balancers between every critical tier (client→webserver, webserver→app, app→db)
Mention redundant load balancers (active-passive) to avoid SPOF
Discuss which algorithm and why
Consider geographic load balancing (DNS-based) for global systems

Load balancing

TL;DR

The Problem It Solves

What Is It?

How It Works

Key Components

Types / Variations

Layer 4 vs. Layer 7

Load Balancing Algorithms

Hardware vs. Software vs. Cloud-Managed

Trade-offs

When to Use It / When to Avoid It

Real-World Examples

How This Shows Up in Interviews

Test Your Understanding

Quick Recap

In System Design Interviews

Comments

Load balancing

TL;DR

The Problem It Solves

What Is It?

How It Works

Key Components

Types / Variations

Layer 4 vs. Layer 7

Load Balancing Algorithms

Hardware vs. Software vs. Cloud-Managed

Trade-offs

When to Use It / When to Avoid It

Real-World Examples

How This Shows Up in Interviews

Test Your Understanding

Quick Recap

In System Design Interviews

Comments