Canary deployment
How canary deployments reduce blast radius by routing a small percentage of production traffic to the new version, with progressive promotion, automated rollback triggers, and metrics-driven confidence.
TL;DR
- Canary deployment routes a small percentage of production traffic (1-5%) to the new version while the majority stays on the stable version.
- You observe real-user metrics (error rate, latency, business KPIs) and progressively promote the canary to more traffic if metrics are healthy.
- If metrics degrade, the canary is rolled back automatically, limiting the blast radius to only the canary cohort.
- The key advantage over blue-green: you validate with real user traffic instead of synthetic tests. The key cost: you need strong observability to make good promotion decisions.
- Automated canary analysis tools (Kayenta, Flagger, Argo Rollouts) compare canary metrics against the stable baseline using statistical tests, removing the need for manual judgment.
The Problem
Your team ships a new recommendation engine. Internal testing looks great. Performance benchmarks pass. Code review approved. You deploy using blue-green: test green, switch 100% of traffic. Within 10 minutes, customer support floods with reports. The new engine returns irrelevant results for users with sparse browsing history, a case your test data didn't cover. All 100% of users experience the degraded recommendations for 10 minutes before you roll back.
The problem isn't the deployment mechanism. Blue-green worked exactly as designed. The problem is that no amount of synthetic testing perfectly simulates real user behavior. Edge cases only surface under real traffic: unusual account states, mobile clients with intermittent connections, unexpected input combinations, traffic patterns that differ from your test suite.
What if instead of switching 100% at once, you could send 1% of traffic to the new version, watch what happens for 15 minutes, and only promote to the next stage if metrics are healthy? That's the canary model. The 1% who experience a problem gives you the signal to stop, and the other 99% never know there was a bad deploy.
The trade-off: you need observability good enough to detect problems in a 1% traffic slice. If you can't measure it, you can't canary it.
One-Line Definition
Canary deployment progressively shifts real user traffic from the stable version to the new version in stages (1% to 5% to 25% to 100%), using automated metrics comparison to decide whether to promote or rollback at each stage.
Analogy
A food company wants to change a recipe. Instead of shipping the new recipe to every store at once, they stock it in 5 stores in one city. They watch sales data and customer complaints for a week. If the numbers look good, they expand to 50 stores. Then 500. Then nationwide. If complaints spike at 5 stores, they pull it back before 99% of customers ever tasted the new recipe. The 5-store test is the canary. The key: the decision to expand is based on measured outcomes, not hope.
Solution Walkthrough
Progressive promotion flow
The core of canary deployment is a staged traffic shift with metric gates between each stage. Each stage is a decision point: promote, hold, or rollback.
The wait times between stages are intentional. Some bugs take time to manifest: memory leaks that build over minutes, cache TTL-related issues that surface after the cache expires, or business logic bugs that only trigger at certain times of day.
What to watch at each stage
Not all metrics matter at every stage. Early stages focus on catastrophic failure detection. Later stages focus on subtle regression detection.
| Stage | Traffic | Duration | Technical Gate | Business Gate |
|---|---|---|---|---|
| 1% canary | 1% | 5-15 min | Error rate below 1%, no crash loops | None (too little data) |
| 5% canary | 5% | 15-30 min | Error rate below 0.5%, p99 under 1.5x baseline | None |
| 25% canary | 25% | 30-60 min | Error rate below 0.3%, p99 under 1.3x baseline | Conversion rate within 2% of stable |
| 50% canary | 50% | 30-60 min | Same as 25% gate | Revenue per session within 2% of stable |
| 100% promotion | 100% | N/A | N/A | N/A |
The most important rule: always compare canary metrics to the stable cohort running at the same time, not to historical baselines. Traffic patterns vary by hour and day. A 5% error rate at 3 AM might be normal (batch jobs), while 0.5% at 2 PM is a disaster.
The metrics monitoring feedback loop
Canary analysis runs continuously during each stage, comparing the canary cohort against the stable cohort in real time.
Statistical comparison (not just threshold comparison) is critical. A canary with a 0.3% error rate might look fine in isolation, but if the stable cohort has 0.1%, that's a 3x increase. Canary analysis tools like Netflix's Kayenta use Mann-Whitney U tests to determine if the difference is statistically significant or just noise.
Automated rollback decision tree
The automated analysis at each stage follows a three-step decision process. This is the logic that tools like Kayenta and Flagger implement:
The "HOLD" decision is important. Some regressions are borderline. Rather than immediately rolling back, the system extends the observation window to collect more data. If the regression persists with more data, it moves to ROLLBACK. If it resolves, it moves to PROMOTE.
Routing strategies
Random percentage (most common):
# Nginx weighted upstream
upstream backend {
server stable-v1:8080 weight=95;
server canary-v2:8080 weight=5;
}
User-cohort sticky canary:
def route_request(user_id: str, canary_pct: int) -> str:
# Same user always goes to same version
# Deterministic routing for session consistency
if hash(user_id) % 100 < canary_pct:
return "canary"
return "stable"
Header-based (for internal testing):
def route_request(request) -> str:
if request.headers.get("X-Canary") == "true":
return "canary" # Internal users can opt-in
return weighted_route(request.user_id)
Sticky canary is better for detecting issues that require multiple requests to surface (session state corruption, account-level bugs). Random percentage is simpler and catches stateless bugs faster. Most teams use sticky canary as the default because inconsistent user experience (sometimes seeing v1, sometimes v2) creates confusing bug reports.
Canary + feature flags
Canary and feature flags solve different dimensions of deployment risk. Canary controls which servers run the new code. Feature flags control which users see the new behavior. Combining them gives you defense in depth.
The common pattern: deploy new code behind a feature flag (flag = OFF) via canary. The canary validates that the new code doesn't break anything when the flag is off (no new behavior exposed). Then, once the code is on 100% of servers, enable the feature flag for 1% of users. This separates infrastructure risk (will the new code crash?) from product risk (will users like the new feature?).
Step 1: Canary deploy code (flag OFF) → 1% → 5% → 100%
Validates: no crashes, no latency regression, no memory leaks
Risk: code-level only
Step 2: Enable feature flag → 1% of users → 5% → 100%
Validates: user behavior, business metrics, UX
Risk: product-level only
This two-phase approach means a bad feature can be disabled instantly via flag without a redeployment. The code stays deployed, healthy, and serving the old behavior while the team investigates.
Canary with stateful clients
If your application uses client-side state (local storage, cached tokens, service workers), a user who bounces between canary and stable may experience subtle bugs. Sticky routing by user ID prevents this but reduces the randomness of your canary sample.
Implementation Sketch
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.