On-call at 2am with an unknown alert
The systematic approach to a live production incident β the exact sequence of steps to take when you're paged and don't know what's wrong, with communication templates.
The First 5 Minutes
You've been woken up by a PagerDuty alert. 2am. The alert is PaymentService_HighErrorRate. You've been oncall for two weeks but haven't touched the payments service before.
Most engineers' instinct is to immediately start looking at logs, changing things, or waking up a senior engineer. The correct sequence is different.
Step 1 (first minute): Assess severity before anything else
Questions in this order:
1. What is the user impact? (Are users actively failing, or is this a metric threshold?)
2. What is the scope? (100% of traffic? One endpoint? One region?)
3. Is this getting worse, stable, or improving?
A p99 latency spike that's already trending down might resolve itself. An error rate at 100% requires immediate action. These need different responses.
Check: your monitoring dashboard, not just the alert. Alerts fire on thresholds; dashboards show trends.
Step 2 (first 2 minutes): Post a status update before investigating
This feels backwards but it's critical. If you don't post anything, your manager, other on-call engineers, and downstream teams will start paging you.
Template for your incident Slack / Pagerduty note:
"[2:03 AM] Paged for PaymentService_HighErrorRate.
Currently assessing scope and impact.
Will update in 15 minutes or if I need to escalate."
This buys you investigation time and establishes that someone is engaged.
Step 3 (next 8 minutes): The structured investigation
Step 4: Escalate early, not late
The instinct to "figure it out before waking anyone else up" is usually wrong. Escalate if:
- You've spent 15 minutes without a clear hypothesis
- The impact is growing, not contained
- The service is outside your expertise
- You need a second set of eyes on a risky fix
Escalation message template:
"Hey, I need eyes on PaymentService.
Error rate is at 40% for 15 minutes.
No recent deploys.
I've ruled out DB latency (normal) and external rate limits (normal).
My next hypothesis is X but I'm not confident. Can you join a call?"
Good engineers escalate early. Stubborn independence during an active incident is a bug, not a feature.
The Fix Discipline
Two rules that prevent making things worse:
Rule 1: Never apply a fix you don't understand. "Restarting the service" is not a fix β it's a gamble that restoring state will clear the problem. If it works, you'll get paged again tonight, or next week. If it makes things worse, you've made the incident harder.
Rule 2: Change one thing at a time. If you change the config, restart the service, and roll back the deploy simultaneously, and things improve, you don't know what fixed it. You've made the incident harder to learn from.
Before any fix:
"My hypothesis is X. This fix targets X.
If it works, I expect to see the error rate drop within 2 minutes.
If it doesn't work, I'll revert immediately."
Write this in the incident channel before you apply the fix. It forces clarity and creates an audit trail.
The Post-Fix Checklist
After the error rate drops, you're not done:
β Update status: "Error rate back to normal at 2:37 AM. Root cause: [brief]"
β Watch for 10 minutes to confirm stability
β Document what you found and what you changed in the incident channel
β Assess: is there work needed during business hours?
β Write an incident summary if severity was high enough
β File a ticket for any quick fixes you applied that need proper solutions
The "it's resolved, go back to sleep" instinct is wrong. Fifteen minutes of documentation now saves hours of confusion tomorrow.
In the Interview
When asked about being on-call or responding to incidents, the interviewer is evaluating:
- Systematic thinking β did you follow a process, or thrash?
- Communication β did you keep stakeholders informed?
- Escalation judgment β did you know when you were in over your head?
- Post-incident discipline β did you document and follow through?
The best answers include: the specific signals you looked at, the hypotheses you formed, the points where you chose to escalate and why, and what the post-incident work looked like.
Weak: "I got paged and eventually figured out the problem and fixed it."
Strong: "I got paged at 2am for elevated error rates on the payment service.
My first step was to check if errors were 100% of traffic or partial β
they were 40%, concentrated on a single endpoint. I found a deploy
that had happened 20 minutes before the alert fired, issued a rollback
after confirming the rollback was safe, and the error rate dropped
within 90 seconds."
Quick Recap
- Before investigating, assess severity: what's the user impact, what's the scope, and is it getting worse?
- Post a status update before you start debugging β it prevents escalation pile-on and establishes ownership.
- Investigate in order: scope β recent change β service-specific signals. Finding the change is usually faster than debugging the symptom.
- Escalate early: spending 15 minutes without a clear hypothesis is the escalation trigger, not the thing to push through alone.
- Document before and after every fix: hypothesis, expected outcome, result. The audit trail matters for the post-mortem.