Incident runbook design
How to design effective incident runbooks β playbook structure, the detection to resolution pipeline, severity levels, on-call rotations, escalation paths, and blameless postmortem practices.
Why Runbooks Matter
An incident at 3 AM is the worst time to figure out how your system works. Runbooks encode the knowledge of your most experienced engineers into step-by-step instructions that anyone on rotation can follow under stress without heroics.
A good runbook is not a manual β it's a decision tree. "If you see X, try Y. If Y doesn't solve it in 10 minutes, escalate to Z."
Incident Lifecycle
Every incident follows the same pipeline:
Detection -> Triage -> Mitigation -> Resolution -> Postmortem
Detection: alert fires, user report, on-call notified
Triage: assess severity, communicate to stakeholders, start incident channel
Mitigation: stop the bleeding (rollback, kill switch, scale up, redirect traffic)
Resolution: fix root cause, validate fix under production load
Postmortem: blameless review, action items assigned with owners and due dates
Detection to mitigation is the critical window. Optimize for this. Root cause investigation can happen after service is restored.
Severity Levels
Clear severity levels prevent both over-escalation (waking up the CTO for a minor UI bug) and under-escalation (spending an hour debugging alone while customers can't check out):
| Severity | Definition | Response Time | Who |
|---|---|---|---|
| SEV1 | Complete outage or major data loss | Immediate, all hands | On-call + leadership + comms |
| SEV2 | Major feature broken, 10%+ users affected | 15 minutes | On-call + team lead |
| SEV3 | Degraded performance, workaround available | 1 hour | On-call |
| SEV4 | Minor bug, single user, no data loss | Next business day | Team backlog |
Declare severity early and adjust down if assessment improves. It's always better to escalate and stand down than to under-escalate and miss impact.
On-Call Rotation Design
Primary/secondary model: Primary is alerted first and handles the incident. Secondary is backup if primary doesn't acknowledge within 5 minutes and handles escalation.
Rotation structure: One-week rotations are standard. Shorter rotations spread load but increase context-switching cost. Longer rotations cause burnout. Overlap handoffs by 30 minutes.
Fair load: Track alert volume per rotation. If one rotation week consistently receives more alerts, that's a reliability problem β alert fatigue leads to ignored alerts, which leads to actual outages.
On-call compensation: Engineers should be compensated for on-call hours, especially when incidents occur. Uncompensated on-call is unsustainable and drives engineers away from reliability work.
Runbook Structure
Each runbook covers one alert or failure mode:
# Alert: database_primary_high_latency
## What This Alert Means
Primary write latency > 500ms for 5 minutes. Normal is < 80ms.
Potential causes: long-running queries, lock contention, disk I/O saturation.
## Immediate Actions (first 5 minutes)
1. Check: is this a spike or sustained? -> Grafana dashboard link
2. Check: are connections maxed out? -> `SHOW PROCESSLIST` or datadog link
3. Check: what changed in last 30 minutes? -> PagerDuty change feed link
## Diagnosis Steps
If connections are maxed:
-> Identify long-running queries: `SELECT * FROM pg_stat_activity WHERE duration > 30s`
-> Kill blocking queries if safe (no payment processing)
-> Scale connection pool: [link to runbook: scale db connection pool]
If disk I/O saturation:
-> Check disk metrics: [link]
-> If Amazon RDS: trigger IOPS burst if not already active
-> Consider read replica promotion for reads: [link to runbook]
If no obvious cause:
-> Escalate to database oncall: @db-team
## Mitigation Options
- Enable read-only mode for non-critical paths: [feature flag link]
- Rollback last deployment: [deployment rollback link]
- Promote read replica: [runbook link]
## Escalation
If not resolved in 20 minutes: escalate to SEV2, add @db-team and @platform-lead
Communication During Incidents
Dedicated incident Slack channel, created by incident commander on declaration. Every 15-30 minutes, post a status update even if nothing has changed:
[14:45 UTC] - Investigating: error rate at 12%, source unknown. Working to identify.
[15:00 UTC] - Root cause found: connection pool exhausted after deploy at 14:30.
Rolling back deployment now. ETA resolution: 10 minutes.
[15:12 UTC] - Rollback complete. Error rate back to baseline. Monitoring.
[15:30 UTC] - Incident resolved. Setting up postmortem for tomorrow 10am.
Never go silent to stakeholders. "I don't know yet but I'm working on it" is a valid update.
Blameless Postmortem
A postmortem's goal is to prevent the incident from happening again β not to assign fault. If engineers fear blame, they hide information that would help the investigation, and you repeat the incident.
Postmortem structure:
- Timeline: minute-by-minute what happened, detected, done
- Root cause: what changed, what failed, why
- Contributing factors: what made the system fragile (lack of alerts, no canary deployment, missing circuit breaker)
- Impact: duration, affected users, revenue estimate
- Action items: each with an owner and a due date
- What went well: things that helped contain or detect the incident faster
Action items without owners and due dates are not action items. Track them in the same system as engineering tickets.
Quick Recap
- Runbooks encode institutional knowledge into step-by-step decision trees. Engineers under stress at 3 AM execute better with clear steps than from memory. Every alert should have a runbook.
- Severity levels prevent escalation mistakes. Declare high and adjust down rather than under-declare. Clear criteria (% of users affected, specific functionality broken) remove ambiguity under pressure.
- One-week rotations with primary/secondary roles are standard. Track alert volume per rotation week β alert fatigue leading to ignored alerts is a reliability risk in itself.
- During an incident, optimize for mitigation speed, not root cause investigation. Rollback first, understand why later. Status updates every 15-30 minutes to stakeholders even when there is nothing new to report.
- Blameless postmortems improve system resilience over time. Blame causes hiding, hiding causes recurrence. Every postmortem action item needs an owner and a due date or it won't happen.