Incident runbook design

Why Runbooks Matter

An incident at 3 AM is the worst time to figure out how your system works. Runbooks encode the knowledge of your most experienced engineers into step-by-step instructions that anyone on rotation can follow under stress without heroics.

A good runbook is not a manual — it's a decision tree. "If you see X, try Y. If Y doesn't solve it in 10 minutes, escalate to Z."

Incident Lifecycle

Every incident follows the same pipeline:

Detection -> Triage -> Mitigation -> Resolution -> Postmortem

Detection: alert fires, user report, on-call notified
Triage: assess severity, communicate to stakeholders, start incident channel
Mitigation: stop the bleeding (rollback, kill switch, scale up, redirect traffic)
Resolution: fix root cause, validate fix under production load
Postmortem: blameless review, action items assigned with owners and due dates

Detection to mitigation is the critical window. Optimize for this. Root cause investigation can happen after service is restored.

Severity Levels

Clear severity levels prevent both over-escalation (waking up the CTO for a minor UI bug) and under-escalation (spending an hour debugging alone while customers can't check out):

Severity	Definition	Response Time	Who
SEV1	Complete outage or major data loss	Immediate, all hands	On-call + leadership + comms
SEV2	Major feature broken, 10%+ users affected	15 minutes	On-call + team lead
SEV3	Degraded performance, workaround available	1 hour	On-call
SEV4	Minor bug, single user, no data loss	Next business day	Team backlog

Declare severity early and adjust down if assessment improves. It's always better to escalate and stand down than to under-escalate and miss impact.

On-Call Rotation Design

Primary/secondary model: Primary is alerted first and handles the incident. Secondary is backup if primary doesn't acknowledge within 5 minutes and handles escalation.

Rotation structure: One-week rotations are standard. Shorter rotations spread load but increase context-switching cost. Longer rotations cause burnout. Overlap handoffs by 30 minutes.

Fair load: Track alert volume per rotation. If one rotation week consistently receives more alerts, that's a reliability problem — alert fatigue leads to ignored alerts, which leads to actual outages.

On-call compensation: Engineers should be compensated for on-call hours, especially when incidents occur. Uncompensated on-call is unsustainable and drives engineers away from reliability work.

Runbook Structure

Each runbook covers one alert or failure mode:

# Alert: database_primary_high_latency

## What This Alert Means
Primary write latency > 500ms for 5 minutes. Normal is < 80ms.
Potential causes: long-running queries, lock contention, disk I/O saturation.

## Immediate Actions (first 5 minutes)
1. Check: is this a spike or sustained? -> Grafana dashboard link
2. Check: are connections maxed out? -> `SHOW PROCESSLIST` or datadog link
3. Check: what changed in last 30 minutes? -> PagerDuty change feed link

## Diagnosis Steps
If connections are maxed:
  -> Identify long-running queries: `SELECT * FROM pg_stat_activity WHERE duration > 30s`
  -> Kill blocking queries if safe (no payment processing)
  -> Scale connection pool: [link to runbook: scale db connection pool]

If disk I/O saturation:
  -> Check disk metrics: [link]
  -> If Amazon RDS: trigger IOPS burst if not already active
  -> Consider read replica promotion for reads: [link to runbook]

If no obvious cause:
  -> Escalate to database oncall: @db-team

## Mitigation Options
- Enable read-only mode for non-critical paths: [feature flag link]
- Rollback last deployment: [deployment rollback link]
- Promote read replica: [runbook link]

## Escalation
If not resolved in 20 minutes: escalate to SEV2, add @db-team and @platform-lead

Communication During Incidents

Dedicated incident Slack channel, created by incident commander on declaration. Every 15-30 minutes, post a status update even if nothing has changed:

[14:45 UTC] - Investigating: error rate at 12%, source unknown. Working to identify.
[15:00 UTC] - Root cause found: connection pool exhausted after deploy at 14:30.
              Rolling back deployment now. ETA resolution: 10 minutes.
[15:12 UTC] - Rollback complete. Error rate back to baseline. Monitoring.
[15:30 UTC] - Incident resolved. Setting up postmortem for tomorrow 10am.

Never go silent to stakeholders. "I don't know yet but I'm working on it" is a valid update.

Blameless Postmortem

A postmortem's goal is to prevent the incident from happening again — not to assign fault. If engineers fear blame, they hide information that would help the investigation, and you repeat the incident.

Postmortem structure:

Timeline: minute-by-minute what happened, detected, done
Root cause: what changed, what failed, why
Contributing factors: what made the system fragile (lack of alerts, no canary deployment, missing circuit breaker)
Impact: duration, affected users, revenue estimate
Action items: each with an owner and a due date
What went well: things that helped contain or detect the incident faster

Action items without owners and due dates are not action items. Track them in the same system as engineering tickets.

Quick Recap

Runbooks encode institutional knowledge into step-by-step decision trees. Engineers under stress at 3 AM execute better with clear steps than from memory. Every alert should have a runbook.
Severity levels prevent escalation mistakes. Declare high and adjust down rather than under-declare. Clear criteria (% of users affected, specific functionality broken) remove ambiguity under pressure.
One-week rotations with primary/secondary roles are standard. Track alert volume per rotation week — alert fatigue leading to ignored alerts is a reliability risk in itself.
During an incident, optimize for mitigation speed, not root cause investigation. Rollback first, understand why later. Status updates every 15-30 minutes to stakeholders even when there is nothing new to report.
Blameless postmortems improve system resilience over time. Blame causes hiding, hiding causes recurrence. Every postmortem action item needs an owner and a due date or it won't happen.

Why Runbooks Matter

A good runbook is not a manual — it's a decision tree. "If you see X, try Y. If Y doesn't solve it in 10 minutes, escalate to Z."

Incident Lifecycle

Every incident follows the same pipeline:

Detection -> Triage -> Mitigation -> Resolution -> Postmortem

Detection: alert fires, user report, on-call notified
Triage: assess severity, communicate to stakeholders, start incident channel
Mitigation: stop the bleeding (rollback, kill switch, scale up, redirect traffic)
Resolution: fix root cause, validate fix under production load
Postmortem: blameless review, action items assigned with owners and due dates

Detection to mitigation is the critical window. Optimize for this. Root cause investigation can happen after service is restored.

Severity Levels

Clear severity levels prevent both over-escalation (waking up the CTO for a minor UI bug) and under-escalation (spending an hour debugging alone while customers can't check out):

Severity	Definition	Response Time	Who
SEV1	Complete outage or major data loss	Immediate, all hands	On-call + leadership + comms
SEV2	Major feature broken, 10%+ users affected	15 minutes	On-call + team lead
SEV3	Degraded performance, workaround available	1 hour	On-call
SEV4	Minor bug, single user, no data loss	Next business day	Team backlog

Declare severity early and adjust down if assessment improves. It's always better to escalate and stand down than to under-escalate and miss impact.

On-Call Rotation Design

Primary/secondary model: Primary is alerted first and handles the incident. Secondary is backup if primary doesn't acknowledge within 5 minutes and handles escalation.

Rotation structure: One-week rotations are standard. Shorter rotations spread load but increase context-switching cost. Longer rotations cause burnout. Overlap handoffs by 30 minutes.

On-call compensation: Engineers should be compensated for on-call hours, especially when incidents occur. Uncompensated on-call is unsustainable and drives engineers away from reliability work.

Runbook Structure

Each runbook covers one alert or failure mode:

# Alert: database_primary_high_latency

## What This Alert Means
Primary write latency > 500ms for 5 minutes. Normal is < 80ms.
Potential causes: long-running queries, lock contention, disk I/O saturation.

## Immediate Actions (first 5 minutes)
1. Check: is this a spike or sustained? -> Grafana dashboard link
2. Check: are connections maxed out? -> `SHOW PROCESSLIST` or datadog link
3. Check: what changed in last 30 minutes? -> PagerDuty change feed link

## Diagnosis Steps
If connections are maxed:
  -> Identify long-running queries: `SELECT * FROM pg_stat_activity WHERE duration > 30s`
  -> Kill blocking queries if safe (no payment processing)
  -> Scale connection pool: [link to runbook: scale db connection pool]

If disk I/O saturation:
  -> Check disk metrics: [link]
  -> If Amazon RDS: trigger IOPS burst if not already active
  -> Consider read replica promotion for reads: [link to runbook]

If no obvious cause:
  -> Escalate to database oncall: @db-team

## Mitigation Options
- Enable read-only mode for non-critical paths: [feature flag link]
- Rollback last deployment: [deployment rollback link]
- Promote read replica: [runbook link]

## Escalation
If not resolved in 20 minutes: escalate to SEV2, add @db-team and @platform-lead

Communication During Incidents

Dedicated incident Slack channel, created by incident commander on declaration. Every 15-30 minutes, post a status update even if nothing has changed:

[14:45 UTC] - Investigating: error rate at 12%, source unknown. Working to identify.
[15:00 UTC] - Root cause found: connection pool exhausted after deploy at 14:30.
              Rolling back deployment now. ETA resolution: 10 minutes.
[15:12 UTC] - Rollback complete. Error rate back to baseline. Monitoring.
[15:30 UTC] - Incident resolved. Setting up postmortem for tomorrow 10am.

Never go silent to stakeholders. "I don't know yet but I'm working on it" is a valid update.

Blameless Postmortem

Postmortem structure:

Timeline: minute-by-minute what happened, detected, done
Root cause: what changed, what failed, why
Contributing factors: what made the system fragile (lack of alerts, no canary deployment, missing circuit breaker)
Impact: duration, affected users, revenue estimate
Action items: each with an owner and a due date
What went well: things that helped contain or detect the incident faster

Action items without owners and due dates are not action items. Track them in the same system as engineering tickets.

Quick Recap

Runbooks encode institutional knowledge into step-by-step decision trees. Engineers under stress at 3 AM execute better with clear steps than from memory. Every alert should have a runbook.
Severity levels prevent escalation mistakes. Declare high and adjust down rather than under-declare. Clear criteria (% of users affected, specific functionality broken) remove ambiguity under pressure.
One-week rotations with primary/secondary roles are standard. Track alert volume per rotation week — alert fatigue leading to ignored alerts is a reliability risk in itself.
During an incident, optimize for mitigation speed, not root cause investigation. Rollback first, understand why later. Status updates every 15-30 minutes to stakeholders even when there is nothing new to report.
Blameless postmortems improve system resilience over time. Blame causes hiding, hiding causes recurrence. Every postmortem action item needs an owner and a due date or it won't happen.

Incident runbook design

Why Runbooks Matter

Incident Lifecycle

Severity Levels

On-Call Rotation Design

Runbook Structure

Communication During Incidents

Blameless Postmortem

Quick Recap

Comments

Incident runbook design

Why Runbooks Matter

Incident Lifecycle

Severity Levels

On-Call Rotation Design

Runbook Structure

Communication During Incidents

Blameless Postmortem

Quick Recap

Comments