Taking ownership of an unmeasured system
What to do when you inherit a service with no metrics, no runbook, no documentation, and no one who fully understands it β a structured approach for building confidence in an unknown system.
The Scenario
You join a new team, or an existing service gets orphaned to your squad, and you're the owner. The system has been running for three years. There is no runbook. The APM dashboards exist but no one trusts them. The original author left eighteen months ago. When alerts fire, people guess.
This is one of the most common situations engineers face and one of the hardest to explain well in interviews.
What the Interviewer Is Evaluating
When interviewers ask "Tell me about inheriting a system you didn't understand" they're measuring:
- Pragmatism β Can you make incremental progress without needing complete information?
- Systematic thinking β Do you work through unknowns methodically, or randomly?
- Stakeholder management β Do you communicate uncertainty to users and leadership?
- Build vs. ask β Do you know when to dig in yourself vs. find the person who knows?
The Acquisition Framework
Phase 1: Map the blast radius (day 1-2)
Before doing anything else, understand what this service owns and what breaks if it fails:
Questions to answer:
- What does this service do? (user-facing? internal?)
- What calls it? (upstream dependencies)
- What does it call? (downstream dependencies)
- What would break if it went down right now?
- Is there an SLA on it?
You can get most of this from: service mesh topology, API gateway routing, deployment configs, and asking your team lead for 30 minutes.
Phase 2: Audit existing signals
Audit checklist:
β Is there a health check endpoint?
β Are there application-level metrics (not just infra metrics)?
β Are there logs? What log level? Structured or unstructured?
β Are there traces? Do they propagate across service boundaries?
β Are there alerts? Who gets paged? When were they last tuned?
β Is there a dashboard? Does it reflect current architecture?
This tells you what you have to work with before you build anything new.
Phase 3: Run through the failure scenarios manually
For each critical path your service owns (e.g., "process a payment"), walk through:
"If this step fails, what happens?
Is there a fallback? Is it tested?
Is the failure observable? Would I see it in monitoring?
Is there retry logic? Is it bounded?"
This is a live documentation exercise. Write it down as you go. You now have the beginnings of a runbook.
Phase 4: Add coverage to the dark spots
Now you know which failure modes are unobservable. Add metrics/logs in priority order:
Priority 1 (add this week):
- Error rate on critical paths
- p99 latency on user-facing endpoints
- Queue depth if async
Priority 2 (add this month):
- Business metric (orders processed / payments authorized)
- Resource saturation (connection pool utilization, thread pool)
Priority 3 (nice to have):
- Detailed per-operation traces
- Cold start metrics
- Dependency health
The Stakeholder Communication
Inheriting an unmeasured system creates a trust problem: you don't know what you don't know, which means you can't make promises about reliability. Say that:
"I'm in the process of documenting this service's failure modes and
adding observability. Right now I can't commit to a specific uptime
target because I don't have full visibility into the system.
I'll have a clearer picture in two weeks and can give you a more
specific answer then."
This is uncomfortable but correct. The alternative β making a commitment you can't back up β is worse.
The Anti-Patterns
Immediately rewriting it. An unmeasured system you don't understand yet is not a good candidate for a rewrite. You don't know what behavior is intentional vs. accidental yet.
Spending weeks on observability before going on call. You can't wait for perfect visibility before owning the service. Build incrementally while starting to respond to incidents.
Treating the lack of documentation as someone else's failure. Correct observation β but not useful. The system is yours to document now.
Getting confident too fast. Three weeks in, the system seems fine. Four weeks in, a failure mode you didn't know about hits. Maintain healthy uncertainty.
The Story Structure
Context (30s): What was the system? What state was it in when you took over?
Initial investigation (1 min): What did you find? What didn't exist?
Your approach (2 min): Walk through the phases β how did you systematically build understanding? What did you prioritize and why?
First real incident (1 min): What happened when you had to respond to an alert before you fully understood the system? What did you do?
Current state / outcome (30s): What does the system look like now? What documentation, runbooks, dashboards exist?
Quick Recap
- Before adding anything, map the blast radius β understand what breaks if this service fails.
- Audit what observable signals already exist before building new ones.
- Walk the critical paths manually as a failure exercise β it produces a runbook and identifies gaps simultaneously.
- Prioritize observability by: error rate > latency > business metric > resource saturation > detailed tracing.
- Communicate uncertainty to stakeholders early β premature confidence commitments are worse than honest uncertainty.