Taking ownership of an unmeasured system

The Scenario

You join a new team, or an existing service gets orphaned to your squad, and you're the owner. The system has been running for three years. There is no runbook. The APM dashboards exist but no one trusts them. The original author left eighteen months ago. When alerts fire, people guess.

This is one of the most common situations engineers face and one of the hardest to explain well in interviews.

What the Interviewer Is Evaluating

When interviewers ask "Tell me about inheriting a system you didn't understand" they're measuring:

Pragmatism — Can you make incremental progress without needing complete information?
Systematic thinking — Do you work through unknowns methodically, or randomly?
Stakeholder management — Do you communicate uncertainty to users and leadership?
Build vs. ask — Do you know when to dig in yourself vs. find the person who knows?

The Acquisition Framework

Phase 1: Map the blast radius (day 1-2)

Before doing anything else, understand what this service owns and what breaks if it fails:

Questions to answer:
- What does this service do? (user-facing? internal?) 
- What calls it? (upstream dependencies)
- What does it call? (downstream dependencies)
- What would break if it went down right now?
- Is there an SLA on it?

You can get most of this from: service mesh topology, API gateway routing, deployment configs, and asking your team lead for 30 minutes.

Phase 2: Audit existing signals

Audit checklist:
☐ Is there a health check endpoint?
☐ Are there application-level metrics (not just infra metrics)?
☐ Are there logs? What log level? Structured or unstructured?
☐ Are there traces? Do they propagate across service boundaries?
☐ Are there alerts? Who gets paged? When were they last tuned?
☐ Is there a dashboard? Does it reflect current architecture?

This tells you what you have to work with before you build anything new.

Phase 3: Run through the failure scenarios manually

For each critical path your service owns (e.g., "process a payment"), walk through:

"If this step fails, what happens?
 Is there a fallback? Is it tested?
 Is the failure observable? Would I see it in monitoring?
 Is there retry logic? Is it bounded?"

This is a live documentation exercise. Write it down as you go. You now have the beginnings of a runbook.

Phase 4: Add coverage to the dark spots

Now you know which failure modes are unobservable. Add metrics/logs in priority order:

Priority 1 (add this week): 
- Error rate on critical paths
- p99 latency on user-facing endpoints
- Queue depth if async

Priority 2 (add this month):
- Business metric (orders processed / payments authorized)
- Resource saturation (connection pool utilization, thread pool)

Priority 3 (nice to have):
- Detailed per-operation traces
- Cold start metrics
- Dependency health

The Stakeholder Communication

Inheriting an unmeasured system creates a trust problem: you don't know what you don't know, which means you can't make promises about reliability. Say that:

"I'm in the process of documenting this service's failure modes and
 adding observability. Right now I can't commit to a specific uptime 
 target because I don't have full visibility into the system. 
 I'll have a clearer picture in two weeks and can give you a more 
 specific answer then."

This is uncomfortable but correct. The alternative — making a commitment you can't back up — is worse.

The Anti-Patterns

Immediately rewriting it. An unmeasured system you don't understand yet is not a good candidate for a rewrite. You don't know what behavior is intentional vs. accidental yet.

Spending weeks on observability before going on call. You can't wait for perfect visibility before owning the service. Build incrementally while starting to respond to incidents.

Treating the lack of documentation as someone else's failure. Correct observation — but not useful. The system is yours to document now.

Getting confident too fast. Three weeks in, the system seems fine. Four weeks in, a failure mode you didn't know about hits. Maintain healthy uncertainty.

The Story Structure

Context (30s): What was the system? What state was it in when you took over?

Initial investigation (1 min): What did you find? What didn't exist?

Your approach (2 min): Walk through the phases — how did you systematically build understanding? What did you prioritize and why?

First real incident (1 min): What happened when you had to respond to an alert before you fully understood the system? What did you do?

Current state / outcome (30s): What does the system look like now? What documentation, runbooks, dashboards exist?

Quick Recap

Before adding anything, map the blast radius — understand what breaks if this service fails.
Audit what observable signals already exist before building new ones.
Walk the critical paths manually as a failure exercise — it produces a runbook and identifies gaps simultaneously.
Prioritize observability by: error rate > latency > business metric > resource saturation > detailed tracing.
Communicate uncertainty to stakeholders early — premature confidence commitments are worse than honest uncertainty.

The Scenario

This is one of the most common situations engineers face and one of the hardest to explain well in interviews.

What the Interviewer Is Evaluating

When interviewers ask "Tell me about inheriting a system you didn't understand" they're measuring:

Pragmatism — Can you make incremental progress without needing complete information?
Systematic thinking — Do you work through unknowns methodically, or randomly?
Stakeholder management — Do you communicate uncertainty to users and leadership?
Build vs. ask — Do you know when to dig in yourself vs. find the person who knows?

The Acquisition Framework

Phase 1: Map the blast radius (day 1-2)

Before doing anything else, understand what this service owns and what breaks if it fails:

Questions to answer:
- What does this service do? (user-facing? internal?) 
- What calls it? (upstream dependencies)
- What does it call? (downstream dependencies)
- What would break if it went down right now?
- Is there an SLA on it?

You can get most of this from: service mesh topology, API gateway routing, deployment configs, and asking your team lead for 30 minutes.

Phase 2: Audit existing signals

Audit checklist:
☐ Is there a health check endpoint?
☐ Are there application-level metrics (not just infra metrics)?
☐ Are there logs? What log level? Structured or unstructured?
☐ Are there traces? Do they propagate across service boundaries?
☐ Are there alerts? Who gets paged? When were they last tuned?
☐ Is there a dashboard? Does it reflect current architecture?

This tells you what you have to work with before you build anything new.

Phase 3: Run through the failure scenarios manually

For each critical path your service owns (e.g., "process a payment"), walk through:

"If this step fails, what happens?
 Is there a fallback? Is it tested?
 Is the failure observable? Would I see it in monitoring?
 Is there retry logic? Is it bounded?"

This is a live documentation exercise. Write it down as you go. You now have the beginnings of a runbook.

Phase 4: Add coverage to the dark spots

Now you know which failure modes are unobservable. Add metrics/logs in priority order:

Priority 1 (add this week): 
- Error rate on critical paths
- p99 latency on user-facing endpoints
- Queue depth if async

Priority 2 (add this month):
- Business metric (orders processed / payments authorized)
- Resource saturation (connection pool utilization, thread pool)

Priority 3 (nice to have):
- Detailed per-operation traces
- Cold start metrics
- Dependency health

The Stakeholder Communication

Inheriting an unmeasured system creates a trust problem: you don't know what you don't know, which means you can't make promises about reliability. Say that:

"I'm in the process of documenting this service's failure modes and
 adding observability. Right now I can't commit to a specific uptime 
 target because I don't have full visibility into the system. 
 I'll have a clearer picture in two weeks and can give you a more 
 specific answer then."

This is uncomfortable but correct. The alternative — making a commitment you can't back up — is worse.

The Anti-Patterns

Immediately rewriting it. An unmeasured system you don't understand yet is not a good candidate for a rewrite. You don't know what behavior is intentional vs. accidental yet.

Spending weeks on observability before going on call. You can't wait for perfect visibility before owning the service. Build incrementally while starting to respond to incidents.

Treating the lack of documentation as someone else's failure. Correct observation — but not useful. The system is yours to document now.

Getting confident too fast. Three weeks in, the system seems fine. Four weeks in, a failure mode you didn't know about hits. Maintain healthy uncertainty.

The Story Structure

Context (30s): What was the system? What state was it in when you took over?

Initial investigation (1 min): What did you find? What didn't exist?

Your approach (2 min): Walk through the phases — how did you systematically build understanding? What did you prioritize and why?

First real incident (1 min): What happened when you had to respond to an alert before you fully understood the system? What did you do?

Current state / outcome (30s): What does the system look like now? What documentation, runbooks, dashboards exist?

Quick Recap

Before adding anything, map the blast radius — understand what breaks if this service fails.
Audit what observable signals already exist before building new ones.
Walk the critical paths manually as a failure exercise — it produces a runbook and identifies gaps simultaneously.
Prioritize observability by: error rate > latency > business metric > resource saturation > detailed tracing.
Communicate uncertainty to stakeholders early — premature confidence commitments are worse than honest uncertainty.

Taking ownership of an unmeasured system

The Scenario

What the Interviewer Is Evaluating

The Acquisition Framework

Phase 1: Map the blast radius (day 1-2)

Phase 2: Audit existing signals

Phase 3: Run through the failure scenarios manually

Phase 4: Add coverage to the dark spots

The Stakeholder Communication

The Anti-Patterns

The Story Structure

Quick Recap

Comments

Taking ownership of an unmeasured system

The Scenario

What the Interviewer Is Evaluating

The Acquisition Framework

Phase 1: Map the blast radius (day 1-2)

Phase 2: Audit existing signals

Phase 3: Run through the failure scenarios manually

Phase 4: Add coverage to the dark spots

The Stakeholder Communication

The Anti-Patterns

The Story Structure

Quick Recap

Comments