Blue-green deployment
How blue-green deployment eliminates downtime by maintaining two identical production environments and swapping traffic instantly, with rollback guarantees and database migration strategies.
TL;DR
- Blue-green keeps two identical production environments: blue (live) and green (idle). You deploy to the idle one, test it, then flip the load balancer.
- The traffic switch is atomic, not progressive. All users move from v1 to v2 in a single step, providing zero-downtime deployments with no mixed-version window.
- Rollback is instant: flip the load balancer back to blue. No redeployment, no waiting, no downtime.
- The hardest part is database migrations. Both environments share one database, so schema changes must be backward-compatible (expand-and-contract pattern).
- Blue-green costs roughly 2x infrastructure during the deployment window, which is why many teams move to canary deployments once they have the observability to support progressive rollouts.
- Prefer load balancer switching over DNS-based switching. DNS TTL caching makes rollbacks unpredictable.
The Problem
Your team deploys every Thursday at 2 PM. The process: pull the service from the load balancer, deploy new code, restart, run smoke tests, add it back. During the restart window (sometimes 30 seconds, sometimes 5 minutes), the service is down. Customers see 502 errors. The support team braces for the weekly deploy.
Sometimes the deploy fails entirely. The new version crashes on startup, or a configuration is wrong, or a missing environment variable takes down the service. Now you're scrambling to redeploy the old version while customers wait.
Rolling deployments improve the situation but introduce a different problem. For 10 to 15 minutes, some instances run v1 and others run v2. If v2 changes an API response format, v1 clients may break. If v2 modifies session structures, users who bounce between instances get corrupted sessions.
What you really want is a deployment where the switch is instant, all-or-nothing, and reversible. No mixed-version window. No downtime window. You deploy the new version somewhere safe, verify it works, then flip a switch. If it breaks, flip the switch back.
The insight behind blue-green: you don't deploy to the running environment. You deploy to a completely separate, identical environment that nobody is using. You test it in isolation. Then you move users there all at once.
One-Line Definition
Blue-green deployment runs two identical environments (blue = live, green = idle) and deploys new code to the idle one, then atomically switches all traffic from blue to green via a load balancer or DNS change.
Analogy
Think of a theater with two identical stages, stage left and stage right. The audience (users) only faces one stage at a time. While the live show runs on stage left, the crew sets up the next act on stage right: building sets, testing lights, running a full dress rehearsal. When everything is ready, the turntable rotates and the audience instantly sees the new act. If a prop breaks or an actor forgets their line, you rotate back in seconds. No intermission, no awkward scene change. The crew on stage left is still there, ready to perform again.
Solution Walkthrough
The deployment lifecycle
Blue-green has five distinct phases. The key insight is that the new version receives zero user traffic until you're confident it works. Every other deployment strategy (rolling, canary) exposes some users to the new version before you know it works at full scale.
Phase 2 is what separates blue-green from canary. You're testing with synthetic or internal traffic, not real users. This gives you a pre-production validation step in a production-identical environment.
Traffic switching mechanisms
The switch can happen at different layers, each with different speed and flexibility:
| Mechanism | Switch Speed | Rollback Speed | Complexity |
|---|---|---|---|
| Load balancer (ALB/Nginx) | Instant (seconds) | Instant | Low |
| DNS (Route 53 weighted) | Minutes (TTL-dependent) | Minutes | Medium |
| Kubernetes Service selector | Seconds | Seconds | Low (if already on K8s) |
| API Gateway routing | Instant | Instant | Medium |
My recommendation: use the load balancer approach unless you have a specific reason for DNS. DNS TTL caching makes rollback slower and less predictable. Some clients cache DNS for much longer than the TTL specifies.
Health checks: liveness vs readiness probes
Before the load balancer switch, your orchestrator must verify the green environment is genuinely ready for traffic β not just that the process started, but that it can handle requests correctly.
Two probe types serve different purposes:
- Liveness probe: answers "is this process alive?" If it fails, the orchestrator kills and restarts the container. Use it for detecting deadlocks or fatal crashes.
- Readiness probe: answers "is this instance ready to serve requests?" If it fails, the instance is removed from the load balancer rotation but is not restarted.
The distinction matters: a slow database connection pool warmup should fail readiness (keep traffic away) but not liveness (don't restart the container). Triggering a restart while the pool is still warming makes the problem worse.
The readiness probe is your deployment gate: the load balancer switch only fires after all green pods have passed their readiness checks. Without this gate, the switch can route traffic to pods that haven't finished initializing their caches or connection pools.
Session draining
When you flip the switch, some requests are mid-flight on the blue environment. You need to drain those gracefully rather than terminating them.
The standard approach: set the blue instances to "draining" mode. The load balancer stops sending new connections but allows existing connections to complete (typically with a 30-60 second timeout). Only after draining completes does blue go fully idle. Most load balancers (ALB, Nginx, HAProxy) support connection draining natively with configurable timeouts.
DNS-based switching and session draining
DNS switches are harder to drain because you can't force clients to re-resolve DNS. Some clients will keep hitting blue for minutes after the DNS change. If you must use DNS, keep blue running and healthy for at least 2x your TTL value.
Warm-up the green environment
A cold environment handles its first burst of traffic poorly. Connection pools are empty, JIT compilers haven't warmed, caches are cold. Before switching, send synthetic warm-up traffic to green: a few thousand representative requests that prime the caches, connection pools, and class loaders.
I've seen production incidents where green was "healthy" by smoke test standards but fell over under real load because nobody warmed the connection pool to the database.
Rollback flow
Rollback is where blue-green earns its keep. The entire process is a single routing change, not a redeployment.
An important detail: don't destroy the failed green environment after rollback. Keep it running (without traffic) so you can SSH in, inspect logs, and reproduce the bug in a production-identical environment. This is a debugging luxury that canary deployments don't offer as cleanly.
Kubernetes implementation
On Kubernetes, blue-green maps naturally to two Deployments with a single Service. The Service selector points at the active color label.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.