πŸ“HowToHLD
Vote for New Content
Vote for New Content
Home/High Level Design/Concepts

Load balancing

Learn how load balancers distribute traffic across servers, which algorithms to choose, and how to design a highly-available app tier in any system design interview.

40 min read2026-03-23easyload-balancinghldconceptsscalabilityavailability

TL;DR

  • A load balancer sits in front of your server pool and distributes incoming requests so no single instance bears all the traffic.
  • It's what makes horizontal scaling work: adding servers is useless unless something routes traffic to them.
  • Layer 4 (TCP/UDP) load balancers are faster; Layer 7 (HTTP) load balancers are smarter β€” they route by URL, header, or cookie, and terminate SSL.
  • The algorithm matters: round-robin is the default, least-connections wins for long-lived sockets, and sticky sessions are a trap unless you understand their failure modes.
  • The fundamental trade-off: the load balancer itself is now a single point of failure that must be made highly available.

The Problem It Solves

Your startup just got featured on a tech blog. At 8:47 a.m., 50,000 people click the article simultaneously. Your app server process β€” the only one β€” pegs at 100% CPU.

The request queue fills. New connections get rejected with 503 Service Unavailable. Your on-call phone rings.

You spin up a second server. It runs fine. But all 50,000 users are still hammering the first one.

The second server sits at 0% CPU with zero traffic β€” because nobody is routing requests to it. I've seen this exact scenario play out in interviews: candidates propose horizontal scaling but forget the routing layer entirely.

That's the problem. Having more servers solves nothing if traffic has no mechanism to reach them.

The scaling blindspot

Every "scale horizontally" textbook recommendation silently assumes a load balancer already exists. Without one, adding servers doesn't reduce load on your original server at all β€” it just means you have more servers doing nothing.

flowchart TD
  subgraph Internet["🌐 Internet Layer"]
    Users(["πŸ‘€ Users\n50K concurrent\nAll traffic to same IP"])
  end

  subgraph Broken["πŸ’₯ Broken State β€” No Load Balancer"]
    Server1["βš™οΈ App Server 1\nCPU: 100% Β· Queue: Full\n503 errors Β· ~8s latency"]
    Server2["βš™οΈ App Server 2\nCPU: 0% Β· Completely Idle\nReceives zero traffic"]
  end

  Users -->|"All 50K requests β†’ same IP"| Server1
  Server2 -.->|"Unreachable β€” no route to it"| Server2

The fix isn't more servers alone. The fix is a component that knows all your servers exist and can distribute traffic across all of them.


What Is It?

A load balancer is a reverse proxy that sits in front of a pool of servers and distributes incoming requests across them. It continuously monitors server health and routes traffic only to healthy instances.

Analogy: Think of an airport departure terminal with 20 check-in counters. Without a dispatcher, every passenger walks to counter 1. Counter 1 is overwhelmed; counters 2–20 are empty.

With a dispatcher at the entrance directing passengers β€” "Counter 5 has the shortest queue, go there" β€” each counter handles a proportional share and passengers clear in minutes. The dispatcher doesn't do any checking-in; their only job is directing traffic efficiently. I'll often use this analogy in interviews β€” it makes the separation of concerns immediately obvious.

flowchart TD
  subgraph Internet["🌐 Internet Layer"]
    Users(["πŸ‘€ Users\n50K concurrent requests\nSingle DNS entry β†’ VIP"])
  end

  subgraph LBTier["πŸ”€ Load Balancer Tier β€” Active/Passive HA"]
    LB["πŸ”€ Primary Load Balancer\nHealth checks Β· Algorithm routing\nSSL termination Β· Connection draining"]
    LB_Standby["πŸ”€ Standby Load Balancer\nPassive β€” promoted on primary failure\nShared VIP via VRRP / cloud HA"]
  end

  subgraph AppTier["βš™οΈ Stateless App Tier β€” Auto-Scaled"]
    AS1["βš™οΈ App Server 1\nStateless Β· Any request handled\nCPU: ~33% under even load"]
    AS2["βš™οΈ App Server 2\nStateless Β· Any request handled\nCPU: ~33% under even load"]
    AS3["βš™οΈ App Server N\nStateless Β· Auto-added on scale event\nAuto-removed on scale-in"]
  end

  subgraph SessionStore["⚑ Session Store"]
    Redis["⚑ Redis\nSessions · Rate limit counters\nShared by all app servers · < 1ms reads"]
  end

  Users -->|"HTTPS Β· DNS resolves to VIP"| LB
  LB -.->|"Failover Β· VRRP heartbeat"| LB_Standby
  LB -->|"Route Β· round-robin / least-conn"| AS1 & AS2 & AS3
  AS1 & AS2 & AS3 -->|"Session reads / writes"| Redis

The load balancer gives every server in the pool a fair share of work and hides individual server failures from users entirely. A server going down doesn't degrade the service β€” the load balancer simply stops routing to it. Stateless app servers plus a load balancer: that's the foundation every scalable system starts with.


How It Works

Here's exactly what happens when a user's request hits a load balancer:

  1. DNS resolution β€” The client resolves api.yoursite.com to a single Virtual IP (VIP) address. The VIP is owned by the load balancer, not any backend server. This decoupling is what allows backend instances to be added, removed, or replaced without any DNS change.

  2. Connection established β€” At Layer 4, the LB terminates the TCP connection from the client and opens a new one to the chosen backend. At Layer 7, it also parses the HTTP request before making a routing decision.

  3. Algorithm selects a backend β€” The load balancer runs its assignment algorithm to pick one healthy server from the pool. (Algorithms are covered in the next section.)

  4. Health check gate β€” Before routing, and continuously during operation, the LB probes each backend. Only servers that pass health checks are eligible for traffic. A server that returns errors or fails to respond within the timeout is removed from rotation automatically.

  5. Request forwarded β€” The request is proxied to the selected backend. For L7 balancers, headers are injected here: X-Forwarded-For: <client-ip>, X-Request-ID: <trace-id>.

  6. Response returned β€” The backend responds to the LB; the LB returns the response to the original client. From the client's perspective, it's talking to one server. The load balancer is completely transparent.

  7. Connection tracking β€” For long-lived connections (WebSockets, gRPC streaming), the LB pins the entire session to the same backend until the connection closes.

sequenceDiagram
    participant C as πŸ‘€ Client
    participant LB as πŸ”€ LB (Layer 7)
    participant AS as βš™οΈ App Server

    C->>LB: TCP SYN β†’ VIP:443
    Note over LB: TLS handshake β€” decrypt here
    C->>LB: HTTPS GET /api/data
    Note over LB: Parse HTTP headers<br/>Select backend: least-conn<br/>Health gate passed β€” route to AS
    LB->>AS: HTTP GET /api/data<br/>X-Forwarded-For: client-ip<br/>X-Request-ID: trace-id
    activate AS
    Note over AS: Handle request
    AS-->>LB: HTTP 200 OK + payload
    deactivate AS
    Note over LB: Log: path Β· status Β· latency
    LB-->>C: HTTPS 200 OK + payload

The LB is a transparent proxy: the client connects only to the VIP, TLS terminates at the LB, and the backend IP is never exposed to the client. The 1–3ms overhead is the cost of parsing headers and selecting a backend β€” in my experience, it's invisible at any realistic HTTP traffic volume.

Here's what a minimal NGINX upstream config looks like in practice:

# nginx.conf β€” upstream pool with algorithm and health configuration
upstream api_servers {
  least_conn;  # Route to the instance with fewest active connections

  server app-server-1.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
  server app-server-2.internal:3000 weight=1 max_fails=3 fail_timeout=30s;
  server app-server-3.internal:3000 weight=2 max_fails=3 fail_timeout=30s; # 2Γ— capacity

  keepalive 32;  # Keep up to 32 idle upstream connections warm
}

server {
  listen 443 ssl;
  server_name api.yoursite.com;

  # SSL terminates here β€” backends get plain HTTP internally
  ssl_certificate     /etc/ssl/certs/api.crt;
  ssl_certificate_key /etc/ssl/private/api.key;

  location / {
    proxy_pass http://api_servers;
    proxy_set_header X-Forwarded-For $remote_addr;  # Pass real client IP
    proxy_set_header X-Request-ID    $request_id;   # For distributed tracing
    proxy_connect_timeout 5s;
    proxy_read_timeout    30s;
  }
}

Interview tip: name the health check endpoint

When you mention health checks in an interview, say what they check. "A GET /health probe every 5 seconds β€” if 3 consecutive probes fail or return non-2xx, the instance is removed from rotation." That's specific. "Health checks run" is vague and tells the interviewer nothing about your operational thinking.

Here's what the health check decision loop looks like on the LB side:

// Pseudocode β€” active health check loop per backend instance
async function healthCheckLoop(server: BackendServer): Promise<void> {
  while (true) {
    try {
      const res = await fetch(`http://${server.host}/health`, {
        signal: AbortSignal.timeout(2000), // 2s timeout
      });

      if (res.ok) {
        server.consecutiveFailures = 0;
        if (server.status === "unhealthy") {
          server.status = "healthy";
          addToRotation(server); // Re-admit after recovery
        }
      } else {
        server.consecutiveFailures++;
      }
    } catch {
      server.consecutiveFailures++;
    }

    // Remove from pool after 3 consecutive failures
    if (server.consecutiveFailures >= 3 && server.status === "healthy") {
      server.status = "unhealthy";
      removeFromRotation(server); // No traffic until next recovery check
    }

    await sleep(5000); // Re-probe every 5 seconds
  }
}

The state machine that drives this logic:

flowchart TD
    START(["πŸš€ Instance starts\nJoins LB pool"])
    HEALTHY["βœ… HEALTHY\nReceives full traffic share\nProbed every 5 s"]
    DEGRADED["⚠️ DEGRADED\nStill in rotation\nFail count: 1–2 of 3"]
    UNHEALTHY["❌ UNHEALTHY\nRemoved from pool\nProbed β€” no traffic"]

    START -->|"First probe: 200 OK"| HEALTHY
    HEALTHY -->|"Probe fails (1st or 2nd)"| DEGRADED
    DEGRADED -->|"Next probe passes"| HEALTHY
    DEGRADED -->|"3rd consecutive fail"| UNHEALTHY
    UNHEALTHY -->|"Probe passes β€” re-admitted"| HEALTHY

The two-step removal (DEGRADED β†’ UNHEALTHY) prevents a single flaky probe from pulling a healthy server. Recovery is immediate re-admission β€” which is why a slow-start policy (reduced weight for the first 60–90s) matters when a previously failing server comes back. Misconfigure the failure thresholds and you'll spend the next on-call shift chasing phantom outages.


Key Components

ComponentRole
Virtual IP (VIP)The single IP address that DNS resolves to. Owned by the LB tier, not any backend. Allows backends to change freely without client-side DNS impact.
Backend PoolThe set of healthy server instances eligible to receive traffic. The LB manages membership based on health check results.
Health CheckerContinuously probes each backend (TCP ping, HTTP GET, or custom script). Automatically promotes or demotes backends from the pool.
Routing AlgorithmSelects which pool member receives the next request. Responsible for even distribution, respecting server capacity, and adapting to load skew.
SSL TerminatorDecrypts TLS at the LB so backends communicate over plain HTTP internally. Centralises certificate renewal and reduces per-backend CPU overhead.
Connection DrainerOn scale-in or rolling deploy, allows in-flight connections to complete before the backend instance is removed. Prevents mid-request drops.
Session Store (Redis)Not part of the LB itself, but the external session store that makes backends stateless β€” so the LB can route any request to any instance without session loss.
Standby / Secondary LBA passive LB instance that takes over via failover (VRRP, keepalived, or cloud HA) if the primary fails. Eliminates the LB as a single point of failure.

Types / Variations

Layer 4 vs. Layer 7

The single most important classification β€” and the one I see candidates sidestep most often. It comes up in every serious system design conversation. Default to Layer 7 for HTTP workloads; drop to Layer 4 only when the question explicitly involves raw TCP or sub-millisecond routing latency.

FeatureLayer 4 (Transport)Layer 7 (Application)
ProtocolTCP / UDPHTTP / HTTPS / gRPC / WebSocket
Routing basisIP address + port onlyURL path, HTTP headers, cookies, body content
TLS terminationNot typicallyYes β€” standard
Content-based routingNoYes β€” /api/* β†’ API servers, /static/* β†’ CDN origin
PerformanceFaster β€” no HTTP parsing overheadSlightly slower β€” must parse headers per request
ObservabilityLow β€” can't log HTTP status codes or pathsHigh β€” logs URL, status code, latency per request
AWS equivalentNLB (Network Load Balancer)ALB (Application Load Balancer)
Use casesRaw TCP traffic, gaming servers, custom protocolsHTTP APIs, microservices, path-based routing

Interview shortcut: default to L7

Unless the question explicitly involves raw TCP (multiplayer gaming, financial market data feeds, custom binary protocols), default to Layer 7. You get path-based routing, SSL termination, and per-request observability. The performance overhead is negligible for standard HTTP workloads.

Load Balancing Algorithms

AlgorithmHow It WorksBest ForPitfall
Round RobinDistributes requests sequentially: 1β†’2β†’3β†’1β†’2β†’3Homogeneous servers, stateless short-lived requestsTreats all servers equally regardless of current load or capacity
Weighted Round RobinServers get a numeric weight; proportionally more requests go to higher-weight serversMixed-capacity pools (different instance sizes)Weights must be maintained manually as the pool changes
Least ConnectionsRoutes each new request to the server with the fewest active connectionsLong-lived connections (WebSockets, gRPC streaming)Requires tracking per-server connection state
Least Response TimeRoutes to the server with the lowest current average response timeHeterogeneous workloads where some requests are expensiveRequires active sampling and adds coordination overhead
IP Hashhash(client_ip) % N β€” same client always hits same serverLegacy session affinity for stateful backendsBreaks when server count changes; fails behind NAT
RandomPicks a server at random from the healthy poolSimple stateless APIs; eliminates coordination overheadCan cause hot servers by statistical chance
Resource-BasedRoutes based on CPU/memory metrics reported by each backend agentHeterogeneous or variable-capacity workloadsRequires a metrics agent on every backend

IP Hash breaks when servers change

IP Hash provides session affinity, but it's fragile. Adding or removing a server changes N in hash(ip) % N β€” all hash assignments shift. Existing users are suddenly routed to a different server and lose their in-memory session. This is why Redis-backed sessions are the correct solution, not IP Hash.

Hardware vs. Software vs. Cloud-Managed

TypeExamplesThroughputOps OverheadCost
Hardware applianceF5 BIG-IP, Citrix ADCVery high β€” dedicated ASICHigh β€” physical box, firmware upgradesVery high β€” $10K–$100K+
SoftwareNGINX, HAProxy, EnvoyHigh β€” software-defined, runs on commodity hardwareMedium β€” you manage config, upgrades, HALow β€” open source
Cloud-managedAWS ALB/NLB, GCP Cloud LB, Azure LBScales automaticallyVery low β€” provider-managed HA, scalingPay per request + LCU

For anything you're building from scratch today: cloud-managed is the default. You get automatic HA, multi-AZ redundancy, and auto-scaling for fractions of a cent per LCU. I'd only reach for HAProxy or NGINX when the requirement is explicitly on-prem or you need configuration that managed offerings don't support β€” otherwise you're managing infrastructure for no reason.


Trade-offs

ProsCons
Eliminates app-tier SPOF β€” one instance down, others continueThe LB itself is now a potential SPOF (mitigated with active/standby HA or cloud-managed)
Enables horizontal scaling β€” add instances and traffic automatically distributesOne additional network hop β€” typically 1–3ms latency overhead
Zero-downtime deployments β€” roll instances out one at a time with connection drainingSSL termination at LB means backend traffic is unencrypted internally (mitigate with end-to-end TLS or a service mesh)
Health checks remove failing instances within seconds β€” transparent to usersStateful protocols (WebSockets, gRPC streaming) require connection pinning or L7 session tracking
SSL termination centralises certificate management for all backendsMisconfigured health checks cause false positives (healthy servers pulled) or false negatives (broken servers kept in rotation)
Single point for access logs, metrics, and trace ID injectionMisconfigured drain timeout causes mid-request drops during deploys

The fundamental tension here is availability vs. complexity. A load balancer solves the single-point-of-failure problem for your app tier, but introduces itself as a new component that must be made highly available, monitored, and correctly configured.

The mistake I see most often: candidates draw the load balancer in their diagram without mentioning that it now needs HA too. Address both in the same breath β€” the interviewer will ask if you don't.


When to Use It / When to Avoid It

So when does this actually matter in an interview? Almost always β€” any system with more than one server needs one. Here's the practical guide.

Use a load balancer when:

  • You have 2+ backend instances that should share traffic.
  • You need fault tolerance β€” one instance failing must not take down the service.
  • You need zero-downtime rolling deployments (draining connections from instances one at a time).
  • You need SSL termination at a single point rather than managing certificates on every server.
  • You need path-based routing to multiple services from a single entry point.

Avoid or simplify when:

  • You're in a development environment β€” local port forwarding or a single process is sufficient.
  • You have a monolith with no traffic redundancy requirement β€” a plain reverse proxy (NGINX) is often enough.
  • You're routing internal service-to-service (east-west) traffic at high volume β€” consider a service mesh (Istio, Linkerd) rather than a centralised LB per route.
  • You're prototyping β€” get the system working first, then add the LB tier before any production deploy.

Load balancer vs. API Gateway vs. reverse proxy

These three are often conflated. A reverse proxy (NGINX serving static files) just forwards traffic to one backend. A load balancer distributes across multiple backends with health checks. An API Gateway does routing plus auth, rate limiting, and protocol transformation. In practice, products like NGINX and Envoy can do all three β€” the question is which capabilities you're actually configuring.


Real-World Examples

Google β€” Maglev Google built a custom software load balancer called Maglev that runs on commodity servers and handles over one million packets per second per machine. Maglev uses consistent hashing over a connection table of 65,537 buckets β€” a prime number chosen for uniform distribution β€” so the same connection always reaches the same backend even when backends are added or removed.

The design handles up to 640 Gbps per cluster and sits in front of every Google service globally. Maglev is a Layer 4 LB β€” the routing decision happens before any HTTP parsing.

Netflix β€” AWS ALB + Eureka + Ribbon Netflix uses AWS ALBs as primary Layer 7 load balancers in front of their microservice clusters. They supplement this with client-side load balancing via Eureka (service registry) and Ribbon (in-process LB library) β€” each service instance picks a backend directly, eliminating a full network hop for all internal traffic.

New instances use Ribbon's slow-start: reduced traffic weight for the first 90 seconds while their JVM JIT warms up, preventing cold instances from being overwhelmed. The slow-start pattern is worth stealing for any JVM or interpreted-language service you build.

Cloudflare β€” Geographic load balancing Cloudflare's load balancer operates at the DNS level. When a client resolves your API hostname, Cloudflare returns the IP of the nearest healthy origin β€” informed by both geographic proximity and measured round-trip latency from multiple global vantage points.

Health checks run from multiple locations every 60 seconds. If an origin starts failing, DNS responses switch to the next healthy origin in well under a minute β€” geographic failover with zero backend changes required.


How This Shows Up in Interviews

Every system design interview with a backend tier needs a load balancer in the first diagram you draw β€” not as an afterthought after the interviewer asks "but what about availability?" Draw it immediately, name the layer, and state your algorithm choice in one sentence. The load balancer signals that you understand horizontal scaling isn't just "add more servers."

When to bring it up

Draw a load balancer in the first component you sketch for any system with multiple backend instances. Don't wait to be asked. Within 3 minutes of starting your design, a sentence like: "I'll put a Layer 7 load balancer here β€” it handles SSL termination, distributes traffic across app server instances, and removes unhealthy ones automatically" signals that you understand the fundamentals of availability.

Don't over-explain it

The load balancer is table stakes. Interviewers expect it to be there. What they want to hear is why, and what your specific choices are β€” not a description of round-robin. Open with the algorithm choice and HA setup in one sentence, then move to the more interesting design decisions.

Depth expected at senior/staff level:

  • Name the algorithm and justify it for this specific workload. WebSockets β†’ least-connections. Stateless REST β†’ round-robin. Mixed capacity β†’ weighted.
  • Proactively address the LB as a potential SPOF. Mention active/standby HA or the cloud-managed equivalent.
  • Know when to use L4 vs. L7 and what that changes about the design.
  • Understand that stateless app servers are a prerequisite for the LB to route correctly β€” not a follow-on optimisation.
  • Know what connection draining is and why it matters for zero-downtime deploys.

Common follow-up questions and strong answers:

Interviewer asksStrong answer
"What if the load balancer itself goes down?""Active/standby via VRRP β€” the standby holds the same VIP and promotes within 5 seconds. For cloud deployments, managed LBs (ALB, NLB) are inherently multi-AZ; the provider handles HA."
"How do you handle WebSockets?""Layer 7, least-connections algorithm. The LB pins the entire WebSocket session to one backend for its lifetime. Don't use round-robin β€” it may distribute upgrade/message frames to different servers."
"Why not use IP Hash for session affinity?""IP Hash breaks when server count changes β€” all hash assignments shift. It also fails behind carrier NAT where many clients share one IP. The correct solution is stateless backends with session state in Redis."
"How do you do a zero-downtime deploy?""Connection draining: stop new requests to the instance being updated, wait for in-flight requests to complete (30s drain timeout), then replace the instance. Users never hit a server mid-deploy."
"L4 or L7 β€” which would you pick?""L7 almost always β€” I get URL-based routing, SSL termination, and per-request observability. L4 only if the protocol is raw TCP or ultra-low-latency requirements make HTTP parsing overhead unacceptable."

Test Your Understanding


Quick Recap

  1. A load balancer distributes incoming traffic across a pool of healthy backend servers, eliminating the single-server bottleneck and making horizontal scaling possible. Without one, adding servers doesn't reduce load on the original.
  2. Layer 4 load balancers route by TCP/IP address and port β€” fast, protocol-agnostic. Layer 7 route by HTTP content (URL, headers, cookies) and terminate SSL β€” smarter, with per-request observability. Default to L7 for HTTP workloads.
  3. Algorithm choice drives real outcomes: round-robin for homogeneous stateless services; least-connections for long-lived sockets (WebSockets, gRPC streaming); weighted round-robin for mixed-capacity pools.
  4. Stateless app servers β€” sessions in Redis, not in process memory β€” are the prerequisite for the LB to route correctly. IP Hash (sticky sessions) breaks when server count changes and causes mass session loss on instance failure.
  5. Active health checks (GET /health) detect dead servers, not degraded ones. Passive outlier detection is needed to catch slow or error-prone instances before they saturate and cause visible failures.
  6. The load balancer itself must be made highly available: active/standby with a shared Virtual IP, or a cloud-managed LB that is inherently multi-AZ. A single LB node is just a new single point of failure.
  7. In interviews, name the algorithm and justify it, proactively address LB HA, explain stateless design as a prerequisite, and describe connection draining for zero-downtime deploys β€” these four together signal staff-level depth.

Related Concepts

  • Scalability β€” Load balancing is the mechanism that makes horizontal scaling of the app tier possible. A stateless app tier behind a load balancer is the core pattern for handling 10Γ— traffic spikes.
  • API Gateway β€” An API Gateway includes a load balancer but adds routing, auth, rate limiting, and protocol translation. Know when a gateway adds value vs. when a bare load balancer is sufficient.
  • Caching β€” Even a perfectly load-balanced app tier gets overwhelmed if every instance makes separate database reads. Caching in Redis keeps DB load flat as the number of instances grows.
  • Rate Limiting β€” Rate limiters live at the load balancer or immediately behind it. Redis-backed distributed rate limiting prevents any single client from overwhelming the backend pool regardless of which instance handles their requests.
  • Service Mesh β€” A service mesh handles load balancing for east-west (service-to-service) traffic, with retries, circuit breaking, and mTLS baked in. A front-end load balancer handles north-south (client-to-service) traffic. You typically need both.
  • Envoy β€” Modern, microservices-focused

In System Design Interviews

When discussing load balancing in interviews:

  1. Place load balancers between every critical tier (client→webserver, webserver→app, app→db)
  2. Mention redundant load balancers (active-passive) to avoid SPOF
  3. Discuss which algorithm and why
  4. Consider geographic load balancing (DNS-based) for global systems

Previous

Scalability

Next

Caching

Comments

On This Page

TL;DRThe Problem It SolvesWhat Is It?How It WorksKey ComponentsTypes / VariationsLayer 4 vs. Layer 7Load Balancing AlgorithmsHardware vs. Software vs. Cloud-ManagedTrade-offsWhen to Use It / When to Avoid ItReal-World ExamplesHow This Shows Up in InterviewsTest Your UnderstandingQuick RecapRelated ConceptsIn System Design Interviews