Diagnosing and explaining a cloud cost spike

Why This Question Is Asked

"Tell me about a time you had an unexpected cost increase" tests whether engineers think about the economic dimension of systems, not just the technical one. At senior/staff level, you're expected to understand the relationship between system behavior and cloud spend, and to communicate it in terms leadership can act on.

The Investigation Sequence

Step 1: Get the scope immediately

Before any investigation:
- How much is the spike? (10%? 10x? One region or global?)
- When did it start? (Last night? Last week? After the quarter-end?)
- Which service or product is responsible? (Most billing tools show cost by tag or service)
- Is it still growing or has it plateaued?

A 15% increase over 60 days is a slow leak — important but not urgent. A 10x increase overnight is a potential runaway process or misconfiguration.

Step 2: Trace cost to root cause

The breakdown path depends on the cloud provider, but the categories are consistent:

Compute cost spike:
└─ New instances launched? (Auto-scaling event? Runaway ASG?)
└─ Existing instances running at higher utilization? 
   (CPU-bound? Right-sized?)
└─ New workloads deployed? (Feature that's more resource-intensive?)

Data transfer / egress cost spike:
└─ Cross-region traffic? (New service calling another region?)
└─ CDN bypass? (Static assets being served from origin instead of edge?)
└─ External API calls? (New integration sending large payloads?)

Storage cost spike:
└─ Log volume increase? (New verbose logging accidentally in prod?)
└─ Database size increase? (Data growth expected? Unexpected rows?)
└─ Snapshot / backup accumulation? (Retention policy misconfigured?)

Managed service cost spike:
└─ Request volume change? (Traffic increase? Polling loop?)
└─ Feature tier change? (New feature triggering higher pricing tier?)
└─ Idle provisioned capacity? (DynamoDB on-demand vs. provisioned mismatch?)

Most tools (AWS Cost Explorer, GCP Cost Breakdown, Azure Cost Analysis) can filter by service, resource tag, and time. You should be able to attribute the spike to a specific service within 30 minutes for most cases.

Step 3: Classify the cause

Category A: Runaway resource (immediate action required)
→ Lambda in infinite loop, ASG that won't scale down, backup job without deletion policy
→ Fix: kill the runaway process immediately; address root cause separately

Category B: Architecture smell (improvement opportunity)
→ Cross-region calls that should be regional, un-cached API calls, full scans on large tables
→ Fix: engineer solution with proper timeline; quantify savings to prioritize

Category C: Expected growth (inform, don't fix)
→ Traffic increased, so compute increased proportionally
→ Action: forecast the growth trend, right-size instance types, present to leadership

Treating all three categories the same is a mistake. Category A needs emergency response. Category C might need nothing.

The Stakeholder Communication

Most cost spike conversations need to happen at two levels: technical (what happened and how to fix it) and leadership (what does this mean for the budget and timeline).

Technical audience

"The cost spike is in the data transfer line item. Our new recommendations 
service is calling the product catalog service in us-east-1 from our 
EU cluster. Each recommendation request triggers 3 catalog API calls, 
and cross-region data transfer is billed per GB. 
The fix is to either cache catalog data in the EU cluster or replicate 
the catalog to EU. I estimate caching would reduce this cost by 90%. 
The engineering work is about 1 sprint."

Leadership / finance audience

"We had a $15,000 cost increase on our cloud bill this month. 
This is from the recommendations feature we launched last quarter — 
it's working, but it's talking to a service in a different region and 
that transfer is billed by the cloud provider. 
We have a fix scoped that will reduce this by ~$13,000/month 
and will take one sprint to implement."

The principles: quantify the problem, show a path to resolution, don't hide the root cause.

The Cost Attribution Infrastructure

Strong answers include the observation that cost spikes are often discoverable only after the fact because cost attribution infrastructure is underdeveloped. The proactive version:

Tags on all resources (service, environment, team)
Weekly cost reports by team/service
Alerting on anomalous cost change (20%+ week-over-week)
Budget alarms in the cloud provider

If your org doesn't have these, propose them as a follow-on from the incident. "We found this cost spike after the fact because we don't have cost anomaly alerting" is a legitimate finding.

Story Structure

Context (30s): What was the system, how large was the spike, how was it first noticed?

Investigation (1-2 min): Walk through the attribution sequence. What did you look at? What did you find?

Root cause (30s): Single clear sentence: "The root cause was X."

Fix (30s): What did you change? What was the impact?

Communication (30s): Who did you inform? What did you tell them?

Prevention (30s): What infrastructure or process was added to catch this early next time?

Quick Recap

Scope the spike first: magnitude, timing, which service — before investigating root cause.
Trace cost through four categories: compute, data transfer, storage, and managed service usage.
Classify the cause: runaway resource (immediate fix), architecture smell (scoped improvement), or expected growth (forecast and inform).
Communicate at two levels: technical (what happened, how to fix) and leadership (dollar amount, timeline, path to resolution).
Propose cost attribution infrastructure as a follow-on — cost anomaly alerting and resource tagging are the minimum viable cost hygiene.

Why This Question Is Asked

The Investigation Sequence

Step 1: Get the scope immediately

Before any investigation:
- How much is the spike? (10%? 10x? One region or global?)
- When did it start? (Last night? Last week? After the quarter-end?)
- Which service or product is responsible? (Most billing tools show cost by tag or service)
- Is it still growing or has it plateaued?

A 15% increase over 60 days is a slow leak — important but not urgent. A 10x increase overnight is a potential runaway process or misconfiguration.

Step 2: Trace cost to root cause

The breakdown path depends on the cloud provider, but the categories are consistent:

Compute cost spike:
└─ New instances launched? (Auto-scaling event? Runaway ASG?)
└─ Existing instances running at higher utilization? 
   (CPU-bound? Right-sized?)
└─ New workloads deployed? (Feature that's more resource-intensive?)

Data transfer / egress cost spike:
└─ Cross-region traffic? (New service calling another region?)
└─ CDN bypass? (Static assets being served from origin instead of edge?)
└─ External API calls? (New integration sending large payloads?)

Storage cost spike:
└─ Log volume increase? (New verbose logging accidentally in prod?)
└─ Database size increase? (Data growth expected? Unexpected rows?)
└─ Snapshot / backup accumulation? (Retention policy misconfigured?)

Managed service cost spike:
└─ Request volume change? (Traffic increase? Polling loop?)
└─ Feature tier change? (New feature triggering higher pricing tier?)
└─ Idle provisioned capacity? (DynamoDB on-demand vs. provisioned mismatch?)

Step 3: Classify the cause

Category A: Runaway resource (immediate action required)
→ Lambda in infinite loop, ASG that won't scale down, backup job without deletion policy
→ Fix: kill the runaway process immediately; address root cause separately

Category B: Architecture smell (improvement opportunity)
→ Cross-region calls that should be regional, un-cached API calls, full scans on large tables
→ Fix: engineer solution with proper timeline; quantify savings to prioritize

Category C: Expected growth (inform, don't fix)
→ Traffic increased, so compute increased proportionally
→ Action: forecast the growth trend, right-size instance types, present to leadership

Treating all three categories the same is a mistake. Category A needs emergency response. Category C might need nothing.

The Stakeholder Communication

Most cost spike conversations need to happen at two levels: technical (what happened and how to fix it) and leadership (what does this mean for the budget and timeline).

Technical audience

"The cost spike is in the data transfer line item. Our new recommendations 
service is calling the product catalog service in us-east-1 from our 
EU cluster. Each recommendation request triggers 3 catalog API calls, 
and cross-region data transfer is billed per GB. 
The fix is to either cache catalog data in the EU cluster or replicate 
the catalog to EU. I estimate caching would reduce this cost by 90%. 
The engineering work is about 1 sprint."

Leadership / finance audience

"We had a $15,000 cost increase on our cloud bill this month. 
This is from the recommendations feature we launched last quarter — 
it's working, but it's talking to a service in a different region and 
that transfer is billed by the cloud provider. 
We have a fix scoped that will reduce this by ~$13,000/month 
and will take one sprint to implement."

The principles: quantify the problem, show a path to resolution, don't hide the root cause.

The Cost Attribution Infrastructure

Strong answers include the observation that cost spikes are often discoverable only after the fact because cost attribution infrastructure is underdeveloped. The proactive version:

Tags on all resources (service, environment, team)
Weekly cost reports by team/service
Alerting on anomalous cost change (20%+ week-over-week)
Budget alarms in the cloud provider

If your org doesn't have these, propose them as a follow-on from the incident. "We found this cost spike after the fact because we don't have cost anomaly alerting" is a legitimate finding.

Story Structure

Context (30s): What was the system, how large was the spike, how was it first noticed?

Investigation (1-2 min): Walk through the attribution sequence. What did you look at? What did you find?

Root cause (30s): Single clear sentence: "The root cause was X."

Fix (30s): What did you change? What was the impact?

Communication (30s): Who did you inform? What did you tell them?

Prevention (30s): What infrastructure or process was added to catch this early next time?

Quick Recap

Scope the spike first: magnitude, timing, which service — before investigating root cause.
Trace cost through four categories: compute, data transfer, storage, and managed service usage.
Classify the cause: runaway resource (immediate fix), architecture smell (scoped improvement), or expected growth (forecast and inform).
Communicate at two levels: technical (what happened, how to fix) and leadership (dollar amount, timeline, path to resolution).
Propose cost attribution infrastructure as a follow-on — cost anomaly alerting and resource tagging are the minimum viable cost hygiene.

Diagnosing and explaining a cloud cost spike

Why This Question Is Asked

The Investigation Sequence

Step 1: Get the scope immediately

Step 2: Trace cost to root cause

Step 3: Classify the cause

The Stakeholder Communication

Technical audience

Leadership / finance audience

The Cost Attribution Infrastructure

Story Structure

Quick Recap

Comments

Diagnosing and explaining a cloud cost spike

Why This Question Is Asked

The Investigation Sequence

Step 1: Get the scope immediately

Step 2: Trace cost to root cause

Step 3: Classify the cause

The Stakeholder Communication

Technical audience

Leadership / finance audience

The Cost Attribution Infrastructure

Story Structure

Quick Recap

Comments