Canary rollout for agent policies

TL;DR

An agent policy is the full configuration bundle: system prompt + model version + tool permissions + temperature + guardrail config. Changing any of these can cause subtle quality regressions that unit tests and evals don't catch.
Canary rollout deploys the new policy to 1-5% of production traffic and compares quality metrics (task completion rate, cost per task, error rate, latency P95) against the control group running the old policy.
Automatic rollback triggers when any metric degrades beyond a threshold: completion rate drops >5%, cost increases >20%, or guardrail trigger rate increases >3x.
Progressive promotion (1% -> 5% -> 25% -> 50% -> 100%) with metric gates at each stage catches regressions at every scale.
Production teams using canary rollout for prompt changes report catching 60-80% of regressions before full deployment, compared to near-zero with immediate rollout.
Limitation: statistical significance requires sample volume. At 1% traffic, low-traffic agents may need days to accumulate enough data for confident promotion or rollback decisions.

Your customer support agent runs on Claude Sonnet with a carefully tuned system prompt. The prompt engineering team rewrites the system prompt to improve tone and reduce hallucinations. They test it on 50 evaluation examples and see a 12% improvement in quality scores. They deploy to production at 3 PM on a Tuesday.

By 5 PM, the support queue is backed up. The new prompt handles standard billing inquiries better, but it completely breaks the refund workflow. The model now refuses to process refund requests, interpreting them as "adversarial" because the new guardrail instructions are too aggressive. The 50-example eval set didn't include a single refund scenario.

I've seen this exact failure mode three times across different teams. The pattern is always the same: a prompt change that improves the majority of cases while catastrophically breaking a minority use case. The eval set, no matter how large, never covers the full distribution of production traffic. The only way to catch these regressions is to test on real production traffic at a safe scale.

What Is It?

Canary rollout deploys a new agent policy to a small percentage of production traffic (1-5%) while the remaining traffic continues using the current policy. The system compares real-time quality metrics between the canary (new policy) and control (old policy) groups. If the canary's metrics degrade beyond configurable thresholds, the system automatically rolls back to the old policy. If metrics hold steady or improve, the canary is progressively promoted to larger traffic percentages.

Think of it as a food taster for a medieval king. Before the king eats the full meal, one person takes a small bite. If the taster is fine after 30 minutes, the meal is safe. If the taster gets sick, the king doesn't eat. The 1% canary traffic is your taster. You sacrifice a small number of requests to potential quality issues in exchange for protecting the other 99%.

How It Works

Defining an agent policy

An agent "policy" is not just the system prompt. It's the complete configuration that determines agent behavior. Changing any single element can produce different outputs, and combinations of changes compound unpredictably.

Policy Element	Example	Risk of Change
System prompt	Tone, instructions, guardrails	High: affects every response
Model version	Claude Sonnet 3.5 to 3.6	High: different capabilities and quirks
Temperature	0.7 to 0.3	Medium: changes output diversity
Tool permissions	Add/remove available tools	High: enables/disables entire capabilities
Max tokens	2048 to 4096	Medium: affects cost and detail level
Guardrail config	Content filter thresholds	High: can block valid responses

Version your policies explicitly. A policy is a named, immutable snapshot: policy_v2.4 = {prompt: "...", model: "claude-sonnet-3.5", temperature: 0.7, tools: [...]}. Never mutate a live policy. Create a new version and canary it.

Traffic splitting: deterministic and sticky

The traffic splitter must be deterministic (same user always gets the same policy during a canary) and hash-based (not random). If user A sees policy v2.4 on their first request and v2.3 on their second, you can't compare their experience meaningfully. Sticky routing ensures each user or session is consistently in the canary or control group.

import hashlib

def route_to_policy(user_id: str, canary_percent: int,
                    control_policy: str, canary_policy: str) -> str:
    """Deterministic hash-based routing for canary traffic."""
    hash_val = int(hashlib.sha256(
        f"{user_id}:canary_salt".encode()
    ).hexdigest(), 16) % 100

    if hash_val < canary_percent:
        return canary_policy
    return control_policy

The hash salt (canary_salt) should change between canary experiments. Otherwise, the same users always end up in the canary group, biasing your results to a fixed subset of your user base. Rotate the salt with each new canary deployment.

I learned this the hard way when a team's canary kept "passing" despite known quality issues. It turned out the same 50 power users were always in the canary cohort, and they were sophisticated enough to work around the agent's limitations. Regular users, who never appeared in the canary, struggled with the new prompt.

Quality metrics for agent policies

Traditional software canaries monitor HTTP error rates and latency. Agent canaries need additional metrics because LLM quality regressions are "soft" (the agent returns a response, but the response is wrong or unhelpful).

The most important metric is task completion rate: the percentage of agent sessions that successfully complete the user's requested task. A prompt change that improves response quality by 10% but drops completion rate by 5% is a net negative. Completion rate is the north star because incomplete tasks generate user complaints regardless of response quality.

Cost per task catches regressions that aren't visible in quality metrics. A new prompt might produce equally good output but use 3x more tokens because it triggers longer reasoning chains. At scale, this cost increase can blow budgets without any visible quality change.

Automatic rollback triggers

Define rollback thresholds before the canary starts. Don't decide thresholds after seeing the data (that's p-hacking). Each metric has a threshold that, when breached, triggers automatic rollback.

Metric	Rollback Threshold	Rationale
Task completion rate	Drops >5% vs. control	Core user value metric
Tool call error rate	Increases >3x vs. control	Indicates broken tool integration
Cost per task	Increases >20% vs. control	Budget protection
Latency P95	Increases >50% vs. control	User experience degradation
Guardrail trigger rate	Increases >3x vs. control	New policy may be too restrictive
User abandonment rate	Increases >10% vs. control	Users giving up on the agent

The rollback decision checks all thresholds simultaneously. If any single metric breaches its threshold, roll back immediately. Don't wait for multiple metrics to fail; one broken metric is enough to indicate a problem. Better to roll back a good change prematurely (you can re-deploy it) than to let a bad change degrade production.

def check_rollback(canary_metrics: dict, control_metrics: dict,
                   thresholds: dict) -> tuple[bool, str]:
    """Check if canary should be rolled back."""
    for metric, threshold in thresholds.items():
        canary_val = canary_metrics[metric]
        control_val = control_metrics[metric]

        if metric in ("completion_rate",):
            # Lower is worse for completion rate
            if control_val - canary_val > threshold:
                return True, f"{metric} dropped by {control_val - canary_val:.1%}"
        else:
            # Higher is worse for error, cost, latency
            if control_val > 0 and canary_val / control_val > (1 + threshold):
                return True, f"{metric} increased by {canary_val/control_val - 1:.1%}"
    return False, "All metrics within thresholds"

Progressive promotion pipeline

Don't jump from 1% to 100%. Progressive promotion increases canary traffic in stages, with metric checks at each gate. This catches regressions that only appear at higher traffic volumes (concurrency issues, rate limits, cache effects).

TL;DR

An agent policy is the full configuration bundle: system prompt + model version + tool permissions + temperature + guardrail config. Changing any of these can cause subtle quality regressions that unit tests and evals don't catch.
Canary rollout deploys the new policy to 1-5% of production traffic and compares quality metrics (task completion rate, cost per task, error rate, latency P95) against the control group running the old policy.
Automatic rollback triggers when any metric degrades beyond a threshold: completion rate drops >5%, cost increases >20%, or guardrail trigger rate increases >3x.
Progressive promotion (1% -> 5% -> 25% -> 50% -> 100%) with metric gates at each stage catches regressions at every scale.
Production teams using canary rollout for prompt changes report catching 60-80% of regressions before full deployment, compared to near-zero with immediate rollout.
Limitation: statistical significance requires sample volume. At 1% traffic, low-traffic agents may need days to accumulate enough data for confident promotion or rollback decisions.

Policy Element	Example	Risk of Change
System prompt	Tone, instructions, guardrails	High: affects every response
Model version	Claude Sonnet 3.5 to 3.6	High: different capabilities and quirks
Temperature	0.7 to 0.3	Medium: changes output diversity
Tool permissions	Add/remove available tools	High: enables/disables entire capabilities
Max tokens	2048 to 4096	Medium: affects cost and detail level
Guardrail config	Content filter thresholds	High: can block valid responses

Traffic splitting: deterministic and sticky

import hashlib

def route_to_policy(user_id: str, canary_percent: int,
                    control_policy: str, canary_policy: str) -> str:
    """Deterministic hash-based routing for canary traffic."""
    hash_val = int(hashlib.sha256(
        f"{user_id}:canary_salt".encode()
    ).hexdigest(), 16) % 100

    if hash_val < canary_percent:
        return canary_policy
    return control_policy

Quality metrics for agent policies

Automatic rollback triggers

Define rollback thresholds before the canary starts. Don't decide thresholds after seeing the data (that's p-hacking). Each metric has a threshold that, when breached, triggers automatic rollback.

Metric	Rollback Threshold	Rationale
Task completion rate	Drops >5% vs. control	Core user value metric
Tool call error rate	Increases >3x vs. control	Indicates broken tool integration
Cost per task	Increases >20% vs. control	Budget protection
Latency P95	Increases >50% vs. control	User experience degradation
Guardrail trigger rate	Increases >3x vs. control	New policy may be too restrictive
User abandonment rate	Increases >10% vs. control	Users giving up on the agent

def check_rollback(canary_metrics: dict, control_metrics: dict,
                   thresholds: dict) -> tuple[bool, str]:
    """Check if canary should be rolled back."""
    for metric, threshold in thresholds.items():
        canary_val = canary_metrics[metric]
        control_val = control_metrics[metric]

        if metric in ("completion_rate",):
            # Lower is worse for completion rate
            if control_val - canary_val > threshold:
                return True, f"{metric} dropped by {control_val - canary_val:.1%}"
        else:
            # Higher is worse for error, cost, latency
            if control_val > 0 and canary_val / control_val > (1 + threshold):
                return True, f"{metric} increased by {canary_val/control_val - 1:.1%}"
    return False, "All metrics within thresholds"

Canary rollout for agent policies

TL;DR

The Problem It Solves

What Is It?

How It Works

Defining an agent policy

Traffic splitting: deterministic and sticky

Quality metrics for agent policies

Automatic rollback triggers

Progressive promotion pipeline

Continue Reading with Premium

Comments

Canary rollout for agent policies

TL;DR

The Problem It Solves

What Is It?

How It Works

Defining an agent policy

Traffic splitting: deterministic and sticky

Quality metrics for agent policies

Automatic rollback triggers

Progressive promotion pipeline

Continue Reading with Premium

Comments