Canary rollout for agent policies
Roll out new agent system prompts, tool permissions, and model versions to a small traffic slice first, with automatic rollback if quality metrics drop.
TL;DR
- An agent policy is the full configuration bundle: system prompt + model version + tool permissions + temperature + guardrail config. Changing any of these can cause subtle quality regressions that unit tests and evals don't catch.
- Canary rollout deploys the new policy to 1-5% of production traffic and compares quality metrics (task completion rate, cost per task, error rate, latency P95) against the control group running the old policy.
- Automatic rollback triggers when any metric degrades beyond a threshold: completion rate drops >5%, cost increases >20%, or guardrail trigger rate increases >3x.
- Progressive promotion (1% -> 5% -> 25% -> 50% -> 100%) with metric gates at each stage catches regressions at every scale.
- Production teams using canary rollout for prompt changes report catching 60-80% of regressions before full deployment, compared to near-zero with immediate rollout.
- Limitation: statistical significance requires sample volume. At 1% traffic, low-traffic agents may need days to accumulate enough data for confident promotion or rollback decisions.
The Problem It Solves
Your customer support agent runs on Claude Sonnet with a carefully tuned system prompt. The prompt engineering team rewrites the system prompt to improve tone and reduce hallucinations. They test it on 50 evaluation examples and see a 12% improvement in quality scores. They deploy to production at 3 PM on a Tuesday.
By 5 PM, the support queue is backed up. The new prompt handles standard billing inquiries better, but it completely breaks the refund workflow. The model now refuses to process refund requests, interpreting them as "adversarial" because the new guardrail instructions are too aggressive. The 50-example eval set didn't include a single refund scenario.
I've seen this exact failure mode three times across different teams. The pattern is always the same: a prompt change that improves the majority of cases while catastrophically breaking a minority use case. The eval set, no matter how large, never covers the full distribution of production traffic. The only way to catch these regressions is to test on real production traffic at a safe scale.
What Is It?
Canary rollout deploys a new agent policy to a small percentage of production traffic (1-5%) while the remaining traffic continues using the current policy. The system compares real-time quality metrics between the canary (new policy) and control (old policy) groups. If the canary's metrics degrade beyond configurable thresholds, the system automatically rolls back to the old policy. If metrics hold steady or improve, the canary is progressively promoted to larger traffic percentages.
Think of it as a food taster for a medieval king. Before the king eats the full meal, one person takes a small bite. If the taster is fine after 30 minutes, the meal is safe. If the taster gets sick, the king doesn't eat. The 1% canary traffic is your taster. You sacrifice a small number of requests to potential quality issues in exchange for protecting the other 99%.
How It Works
Defining an agent policy
An agent "policy" is not just the system prompt. It's the complete configuration that determines agent behavior. Changing any single element can produce different outputs, and combinations of changes compound unpredictably.
| Policy Element | Example | Risk of Change |
|---|---|---|
| System prompt | Tone, instructions, guardrails | High: affects every response |
| Model version | Claude Sonnet 3.5 to 3.6 | High: different capabilities and quirks |
| Temperature | 0.7 to 0.3 | Medium: changes output diversity |
| Tool permissions | Add/remove available tools | High: enables/disables entire capabilities |
| Max tokens | 2048 to 4096 | Medium: affects cost and detail level |
| Guardrail config | Content filter thresholds | High: can block valid responses |
Version your policies explicitly. A policy is a named, immutable snapshot: policy_v2.4 = {prompt: "...", model: "claude-sonnet-3.5", temperature: 0.7, tools: [...]}. Never mutate a live policy. Create a new version and canary it.
Traffic splitting: deterministic and sticky
The traffic splitter must be deterministic (same user always gets the same policy during a canary) and hash-based (not random). If user A sees policy v2.4 on their first request and v2.3 on their second, you can't compare their experience meaningfully. Sticky routing ensures each user or session is consistently in the canary or control group.
import hashlib
def route_to_policy(user_id: str, canary_percent: int,
control_policy: str, canary_policy: str) -> str:
"""Deterministic hash-based routing for canary traffic."""
hash_val = int(hashlib.sha256(
f"{user_id}:canary_salt".encode()
).hexdigest(), 16) % 100
if hash_val < canary_percent:
return canary_policy
return control_policy
The hash salt (canary_salt) should change between canary experiments. Otherwise, the same users always end up in the canary group, biasing your results to a fixed subset of your user base. Rotate the salt with each new canary deployment.
I learned this the hard way when a team's canary kept "passing" despite known quality issues. It turned out the same 50 power users were always in the canary cohort, and they were sophisticated enough to work around the agent's limitations. Regular users, who never appeared in the canary, struggled with the new prompt.
Quality metrics for agent policies
Traditional software canaries monitor HTTP error rates and latency. Agent canaries need additional metrics because LLM quality regressions are "soft" (the agent returns a response, but the response is wrong or unhelpful).
The most important metric is task completion rate: the percentage of agent sessions that successfully complete the user's requested task. A prompt change that improves response quality by 10% but drops completion rate by 5% is a net negative. Completion rate is the north star because incomplete tasks generate user complaints regardless of response quality.
Cost per task catches regressions that aren't visible in quality metrics. A new prompt might produce equally good output but use 3x more tokens because it triggers longer reasoning chains. At scale, this cost increase can blow budgets without any visible quality change.
Automatic rollback triggers
Define rollback thresholds before the canary starts. Don't decide thresholds after seeing the data (that's p-hacking). Each metric has a threshold that, when breached, triggers automatic rollback.
| Metric | Rollback Threshold | Rationale |
|---|---|---|
| Task completion rate | Drops >5% vs. control | Core user value metric |
| Tool call error rate | Increases >3x vs. control | Indicates broken tool integration |
| Cost per task | Increases >20% vs. control | Budget protection |
| Latency P95 | Increases >50% vs. control | User experience degradation |
| Guardrail trigger rate | Increases >3x vs. control | New policy may be too restrictive |
| User abandonment rate | Increases >10% vs. control | Users giving up on the agent |
The rollback decision checks all thresholds simultaneously. If any single metric breaches its threshold, roll back immediately. Don't wait for multiple metrics to fail; one broken metric is enough to indicate a problem. Better to roll back a good change prematurely (you can re-deploy it) than to let a bad change degrade production.
def check_rollback(canary_metrics: dict, control_metrics: dict,
thresholds: dict) -> tuple[bool, str]:
"""Check if canary should be rolled back."""
for metric, threshold in thresholds.items():
canary_val = canary_metrics[metric]
control_val = control_metrics[metric]
if metric in ("completion_rate",):
# Lower is worse for completion rate
if control_val - canary_val > threshold:
return True, f"{metric} dropped by {control_val - canary_val:.1%}"
else:
# Higher is worse for error, cost, latency
if control_val > 0 and canary_val / control_val > (1 + threshold):
return True, f"{metric} increased by {canary_val/control_val - 1:.1%}"
return False, "All metrics within thresholds"
Progressive promotion pipeline
Don't jump from 1% to 100%. Progressive promotion increases canary traffic in stages, with metric checks at each gate. This catches regressions that only appear at higher traffic volumes (concurrency issues, rate limits, cache effects).
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.