Blue-green deployment
How blue-green deployment eliminates downtime by maintaining two identical production environments and swapping traffic instantly, with rollback guarantees and database migration strategies.
TL;DR
- Blue-green keeps two identical production environments: blue (live) and green (idle). You deploy to the idle one, test it, then flip the load balancer.
- The traffic switch is atomic, not progressive. All users move from v1 to v2 in a single step, providing zero-downtime deployments with no mixed-version window.
- Rollback is instant: flip the load balancer back to blue. No redeployment, no waiting, no downtime.
- The hardest part is database migrations. Both environments share one database, so schema changes must be backward-compatible (expand-and-contract pattern).
- Blue-green costs roughly 2x infrastructure during the deployment window, which is why many teams move to canary deployments once they have the observability to support progressive rollouts.
- Prefer load balancer switching over DNS-based switching. DNS TTL caching makes rollbacks unpredictable.
The Problem
Your team deploys every Thursday at 2 PM. The process: pull the service from the load balancer, deploy new code, restart, run smoke tests, add it back. During the restart window (sometimes 30 seconds, sometimes 5 minutes), the service is down. Customers see 502 errors. The support team braces for the weekly deploy.
Sometimes the deploy fails entirely. The new version crashes on startup, or a configuration is wrong, or a missing environment variable takes down the service. Now you're scrambling to redeploy the old version while customers wait.
Rolling deployments improve the situation but introduce a different problem. For 10 to 15 minutes, some instances run v1 and others run v2. If v2 changes an API response format, v1 clients may break. If v2 modifies session structures, users who bounce between instances get corrupted sessions.
What you really want is a deployment where the switch is instant, all-or-nothing, and reversible. No mixed-version window. No downtime window. You deploy the new version somewhere safe, verify it works, then flip a switch. If it breaks, flip the switch back.
The insight behind blue-green: you don't deploy to the running environment. You deploy to a completely separate, identical environment that nobody is using. You test it in isolation. Then you move users there all at once.
One-Line Definition
Blue-green deployment runs two identical environments (blue = live, green = idle) and deploys new code to the idle one, then atomically switches all traffic from blue to green via a load balancer or DNS change.
Analogy
Think of a theater with two identical stages, stage left and stage right. The audience (users) only faces one stage at a time. While the live show runs on stage left, the crew sets up the next act on stage right: building sets, testing lights, running a full dress rehearsal. When everything is ready, the turntable rotates and the audience instantly sees the new act. If a prop breaks or an actor forgets their line, you rotate back in seconds. No intermission, no awkward scene change. The crew on stage left is still there, ready to perform again.
Solution Walkthrough
The deployment lifecycle
Blue-green has five distinct phases. The key insight is that the new version receives zero user traffic until you're confident it works. Every other deployment strategy (rolling, canary) exposes some users to the new version before you know it works at full scale.
Phase 2 is what separates blue-green from canary. You're testing with synthetic or internal traffic, not real users. This gives you a pre-production validation step in a production-identical environment.
Traffic switching mechanisms
The switch can happen at different layers, each with different speed and flexibility:
| Mechanism | Switch Speed | Rollback Speed | Complexity |
|---|---|---|---|
| Load balancer (ALB/Nginx) | Instant (seconds) | Instant | Low |
| DNS (Route 53 weighted) | Minutes (TTL-dependent) | Minutes | Medium |
| Kubernetes Service selector | Seconds | Seconds | Low (if already on K8s) |
| API Gateway routing | Instant | Instant | Medium |
My recommendation: use the load balancer approach unless you have a specific reason for DNS. DNS TTL caching makes rollback slower and less predictable. Some clients cache DNS for much longer than the TTL specifies.
Health checks: liveness vs readiness probes
Before the load balancer switch, your orchestrator must verify the green environment is genuinely ready for traffic ā not just that the process started, but that it can handle requests correctly.
Two probe types serve different purposes:
- Liveness probe: answers "is this process alive?" If it fails, the orchestrator kills and restarts the container. Use it for detecting deadlocks or fatal crashes.
- Readiness probe: answers "is this instance ready to serve requests?" If it fails, the instance is removed from the load balancer rotation but is not restarted.
The distinction matters: a slow database connection pool warmup should fail readiness (keep traffic away) but not liveness (don't restart the container). Triggering a restart while the pool is still warming makes the problem worse.
The readiness probe is your deployment gate: the load balancer switch only fires after all green pods have passed their readiness checks. Without this gate, the switch can route traffic to pods that haven't finished initializing their caches or connection pools.
Session draining
When you flip the switch, some requests are mid-flight on the blue environment. You need to drain those gracefully rather than terminating them.
The standard approach: set the blue instances to "draining" mode. The load balancer stops sending new connections but allows existing connections to complete (typically with a 30-60 second timeout). Only after draining completes does blue go fully idle. Most load balancers (ALB, Nginx, HAProxy) support connection draining natively with configurable timeouts.
DNS-based switching and session draining
DNS switches are harder to drain because you can't force clients to re-resolve DNS. Some clients will keep hitting blue for minutes after the DNS change. If you must use DNS, keep blue running and healthy for at least 2x your TTL value.
Warm-up the green environment
A cold environment handles its first burst of traffic poorly. Connection pools are empty, JIT compilers haven't warmed, caches are cold. Before switching, send synthetic warm-up traffic to green: a few thousand representative requests that prime the caches, connection pools, and class loaders.
I've seen production incidents where green was "healthy" by smoke test standards but fell over under real load because nobody warmed the connection pool to the database.
Rollback flow
Rollback is where blue-green earns its keep. The entire process is a single routing change, not a redeployment.
An important detail: don't destroy the failed green environment after rollback. Keep it running (without traffic) so you can SSH in, inspect logs, and reproduce the bug in a production-identical environment. This is a debugging luxury that canary deployments don't offer as cleanly.
Kubernetes implementation
On Kubernetes, blue-green maps naturally to two Deployments with a single Service. The Service selector points at the active color label.
# Service (the traffic switch)
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
color: green # ā change to "blue" for rollback
ports:
- port: 80
targetPort: 8080
---
# Blue Deployment (currently idle)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 10
selector:
matchLabels:
app: myapp
color: blue
template:
metadata:
labels:
app: myapp
color: blue
---
# Green Deployment (currently active)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 10
selector:
matchLabels:
app: myapp
color: green
template:
metadata:
labels:
app: myapp
color: green
The switch is kubectl patch service myapp -p '{"spec":{"selector":{"color":"blue"}}}'. One command, instant effect. Argo Rollouts and Flagger can automate this with health checks and automatic rollback.
Rolling update parameters (Kubernetes)
When blue-green isn't warranted and you fall back to a rolling update, four Kubernetes parameters control the rollout safety profile:
| Parameter | Description | Safe default |
|---|---|---|
maxUnavailable | Max pods that can be down simultaneously | 25% (never 0, that stalls) |
maxSurge | Max extra pods above desired count | 25% |
minReadySeconds | Seconds a pod must be ready before counting as healthy | 30 |
progressDeadlineSeconds | Timeout before the rollout is declared failed | 600 |
The mixed-version hazard. During a rolling update, requests for the same user session can land on v1 or v2 depending on which pod the load balancer hits. If your API response format changed between versions, you will create client-visible inconsistencies. Always keep APIs backward-compatible for the duration of any rolling deployment. This is the key reason to prefer blue-green when API changes are involved.
AWS implementation
On AWS, the most common setup uses ECS with CodeDeploy or ALB target groups:
- Two ECS services (blue and green) registered to two ALB target groups.
- CodeDeploy manages the traffic shift. You specify
AllAtOncefor true blue-green orTimeBasedLinearfor a brief canary period before full switch. - Route 53 weighted routing is the DNS-based alternative, with weights set to 100/0 for blue-green. Switch by changing weights to 0/100. Note the TTL caveat above.
For ECS blue-green, CodeDeploy handles the entire lifecycle: deploy to the replacement task set, wait for the test listener to report healthy, shift production traffic, and terminate the original task set after the confidence window.
Cost optimization: auto-scale the idle environment
You don't need to keep both environments at full capacity permanently. Scale the idle environment to zero or minimal instances between deploys. When a deployment starts, scale up idle to match production capacity, deploy, test, switch, then scale down the old environment. The 2x cost window shrinks from "always on" to "30 minutes during each deploy." On ECS, this is a task count change. On Kubernetes, scale the idle Deployment replicas.
Implementation Sketch
// Blue-green deployment controller (simplified)
interface Environment {
name: "blue" | "green";
version: string;
instances: string[];
status: "live" | "idle" | "draining";
}
async function deploy(newVersion: string): Promise<void> {
const idle = getIdleEnvironment(); // green
const live = getLiveEnvironment(); // blue
// Phase 1: Deploy
await deployToEnvironment(idle, newVersion);
// Phase 2: Test
await runSmokeTests(idle.instances);
await runWarmupTraffic(idle.instances, { requests: 5000 });
// Phase 3: Switch
await updateLoadBalancer({ target: idle.name });
idle.status = "live";
live.status = "draining";
// Phase 4: Drain + Monitor
await waitForDrain(live, { timeout: 60_000 });
live.status = "idle";
// Phase 5: Confidence window
const healthy = await monitorMetrics({ duration: 900_000 }); // 15 min
if (!healthy) {
await updateLoadBalancer({ target: live.name }); // rollback
throw new Error(`Rollback: metrics degraded after switch`);
}
}
Database Migration: The Hard Part
Both environments share one database. When green requires a schema change, blue (your rollback target) must still work with the migrated schema. A migration that drops a column makes rollback impossible.
The solution is the expand-and-contract pattern (also called parallel change). Every migration goes through three phases spread across multiple deploy cycles.
The critical rule: never contract (drop column, add NOT NULL, rename column) until the old version no longer exists as a rollback target. This typically means the contract step happens in the next deploy cycle, not the current one.
In practice, most teams maintain a "migration safety checklist" that categorizes every migration as safe (add nullable column, add index) or requires-expand-and-contract (rename, drop, type change, NOT NULL constraint).
Interview tip: name the pattern
If a system design question involves deployments and databases, mention "expand-and-contract migrations" by name. It signals you understand database schema changes can't be atomic with application deployments.
When It Shines
- Zero-downtime mandatory: banking, healthcare, e-commerce checkout. Any system where even 30 seconds of downtime costs real money.
- Instant rollback required: if your deployment failure recovery time must be under 60 seconds, blue-green gives you a single load balancer flip.
- Compliance environments: blue-green gives auditors a clear "before" and "after" environment they can inspect. Both are running, both can be tested.
- Pre-production validation: when you need to run a full test suite against production-identical infrastructure before any user sees the new code.
- Infrequent, high-confidence deploys: teams that deploy weekly (not hourly) and want maximum safety per deploy.
- Debugging failed deployments: the failed environment stays intact for investigation, unlike rolling updates where the broken version is replaced immediately.
Failure Modes & Pitfalls
The shared database trap. Both environments hit the same database. A migration that's not backward-compatible breaks rollback. I see this fail in production more than any other blue-green issue. Always plan migrations as expand-and-contract.
DNS TTL lag. If you use DNS-based switching (Route 53 weighted routing), cached DNS entries can send traffic to blue for minutes after you switch. ALB target group switching avoids this entirely.
Cold green environment. Green has empty caches, cold connection pools, and uncompiled JIT code. Without a warm-up step, the first few seconds of real traffic hit a slow environment, spiking latency and potentially triggering alerts.
Cost pressure. Running two full environments permanently is expensive. Teams sometimes cut corners by making the idle environment smaller, which means it can't handle full production load and defeats the purpose. Budget for 2x compute during the deployment window, or consider canary deployment instead.
Long-running jobs. If blue is running batch jobs or processing background work, flipping traffic doesn't stop those jobs. You need a drain mechanism for background workers, not just HTTP connections.
Feature flag interaction. If your application uses feature flags that read from a central config service, both blue and green share the same flag state. Toggling a flag affects both environments simultaneously. If you're using flags to gate features during a blue-green deploy, plan the timing carefully: toggle the flag only after the switch, not before.
Health check false positives. A common smoke test only checks GET /health returns 200. This verifies the application boots but not that it functions correctly. Add deep health checks that verify database connectivity, cache access, and critical business logic before declaring green ready for traffic.
Trade-offs
Blue-green vs canary vs rolling update
| Dimension | Blue-Green | Canary | Rolling Update |
|---|---|---|---|
| Traffic switch | All-at-once | Progressive (1% ā 100%) | Pod-by-pod replacement |
| Blast radius of bad deploy | 100% instantly | Proportional to canary % | Proportional to rollout progress |
| Rollback speed | Instant (LB flip) | Instant (route back) | Slow (redeploy old image) |
| Real-user validation | No (tests only) | Yes (canary cohort) | Yes (during rollout) |
| Mixed-version window | None | Yes (canary + stable) | Yes (old + new pods) |
| Infra cost | 2x during deploy | Minimal overhead | No extra cost |
| Schema migration safety | High (test before users) | Same constraint | Same constraint |
For your interview: know the blast radius and rollback speed differences. These are the two dimensions interviewers test most often.
Pros and cons
| Advantage | Disadvantage |
|---|---|
| Zero downtime deployments | 2x infrastructure cost during window |
| Instant rollback (seconds) | Shared database makes schema changes hard |
| Full testing before any user traffic | No real-user validation before switch |
| Simple mental model (two environments) | DNS-based switching has TTL lag |
| Works with any application type | Long-running jobs need separate drain logic |
The fundamental tension: blue-green gives you maximum safety at the moment of switch (instant, atomic, reversible) but at the cost of infrastructure expense and database migration complexity. If you can afford 2x compute and plan your schema changes carefully, it is the safest deployment strategy available. If your deploys are frequent and you want real-user validation, canary deployment is a better fit.
Here's the bottom line: blue-green is the "spend money to buy safety" deployment strategy. Canary is the "spend observability to buy confidence" strategy.
Real-World Usage
Amazon uses blue-green deployment extensively across AWS services. Amazon ECS natively supports blue-green deployments with CodeDeploy, managing traffic shifting between two target groups. AWS documents it as the recommended zero-downtime deployment strategy for ECS services. Internally, Amazon's deployment pipeline (Apollo) uses a variant that deploys to one Availability Zone at a time, combining blue-green's atomicity with geographic isolation.
GitHub deploys using a blue-green-like approach with their chatops-driven deployment system (Hubot + Heaven). Engineers trigger deploys via Slack, code ships to the idle environment, automated tests run, and traffic flips. GitHub deploys dozens of times per day with zero downtime. Their system also supports "branch deploys," where any feature branch can be deployed to the idle environment for pre-merge validation.
Netflix uses a variant called "red/black deployment" (same concept, different color names). They deploy to server groups in AWS Auto Scaling Groups, switch the ELB to the new group, and keep the old group warm for rollback. Their open-source platform Spinnaker automates the entire lifecycle, including automated canary analysis before the final switch. Netflix processes over 200 million streaming sessions daily, making zero-downtime deployment essential.
The pattern is mature enough that all three major cloud providers (AWS CodeDeploy, GCP Cloud Deploy, Azure DevOps) have first-party support for blue-green deployments as a standard pipeline step.
How This Shows Up in Interviews
Blue-green deployment comes up in two contexts during system design interviews.
First, when the interviewer asks "how would you deploy this system with zero downtime?" This is your cue to describe blue-green. Name the two environments, explain the load balancer switch, and mention expand-and-contract for database migrations. That level of specificity scores points.
Second, when discussing trade-offs between deployment strategies. The interviewer might ask "why not canary?" or "what are the downsides?" Be ready with the 2x cost, the database migration complexity, and the lack of real-user validation before the switch.
Here's a sketch you can draw in an interview to explain the concept in 30 seconds:
Before switch: After switch:
āāāāāāāāāāāā āāāāāāāāāāāā
ā Users ā ā Users ā
āāāāāā¬āāāāāā āāāāāā¬āāāāāā
ā ā
āāāāāā¼āāāāāā āāāāāā¼āāāāāā
ā LB ā ā LB ā
āāāā¬āāāāā¬āāā āāāā¬āāāāā¬āāā
ā ā ā ā
ā¼ ā ā ā¼
āāāāā ā ā āāāāā
ā B ā idle idle ā G ā
āv1 ā āāāāā āāāāāāv2 ā
āāāāā ā G ā ā B āāāāāā
āv2 ā āv1 ā
āāāāā āāāāā
For your interview: say "two identical environments, atomic load balancer switch, expand-and-contract for schema changes" and you've covered 90% of what the interviewer wants to hear. If asked about trade-offs, lead with "2x compute cost during the deploy window" and "shared database migrations are the hardest part."
The three-word answer to "how do you do zero-downtime deploys?" is "blue-green deployment." The three-word follow-up to "what's the hardest part?" is "expand-and-contract migrations."
Interview shortcut: blue-green vs canary one-liner
"Blue-green tests in a production-identical environment before any user sees it. Canary tests with real users but limits the blast radius. Blue-green for safety before switch, canary for validation during rollout."
Quick Recap
- Blue-green deployment maintains two identical production environments and deploys new code to the idle one before switching all traffic simultaneously.
- The traffic switch is atomic (usually a load balancer target group change) and rollback is instant (flip back to the old environment).
- The shared database is the primary complexity: schema migrations must use the expand-and-contract pattern to remain backward-compatible with the rollback environment.
- Warm-up traffic to the green environment is essential to prevent cold-start latency spikes after the switch.
- Session draining ensures in-flight requests on blue complete gracefully before the old environment goes idle.
- The 2x infrastructure cost during deployment is the main economic argument for alternatives like canary deployment. Auto-scaling the idle environment reduces this.
- Blue-green tests in a production-identical environment before users see it; canary tests with real users but limits blast radius. Choose based on whether you value pre-switch safety or real-user validation.
Related Patterns
- Canary deployment: progressive rollout that limits blast radius by sending a small percentage of traffic to the new version first. Use instead of blue-green when you want real-user validation during rollout.
- Feature flags: decouple feature release from code deployment. Combine with blue-green to deploy code via blue-green and control feature visibility via flags.
- Strangler fig: incremental migration strategy for replacing legacy systems. Blue-green provides the zero-downtime switch mechanism for each strangler step.
- Circuit breaker: if the green environment starts failing after switch, circuit breakers prevent cascading failures while you roll back to blue.