SLIs, SLOs, and SLAs
How SLIs, SLOs, and SLAs work together with error budgets, burn rate alerting, the difference between an SLO target and an SLA contract, and how reliability goals drive engineering decisions.
TL;DR
- SLIs measure what users experience (availability, latency, error rate). SLOs set the internal target. SLAs are the external contract with financial penalties.
- Good SLIs use ratio format:
good events / total events. Measure user experience, not infrastructure health. CPU at 80% is not an SLI. - Error budgets convert reliability targets into engineering policy. A 99.9% SLO = 43 minutes of allowed downtime per 30 days. When budget is healthy, ship fast. When depleted, freeze features.
- Burn rate alerting fires when you're consuming budget faster than sustainable, not just when a threshold is breached. Multi-window alerting (5min + 1hr) separates real incidents from noise.
- SLOs drive architecture decisions: if a dependency threatens your budget, that's the signal to add redundancy, caching, or circuit breakers.
The Problem It Solves
It's 3 a.m. and your on-call engineer gets paged. Grafana shows CPU at 92% on three app servers. She spends 40 minutes investigating, scales up the fleet, and goes back to sleep. The next morning, she learns that users were completely unaffected: the CPU spike was a background batch job, and all user-facing requests were fine.
The same week, a different incident happens. A silent database connection leak causes 2% of checkout requests to fail with 500 errors for six hours. No alert fires because CPU, memory, and disk all look healthy. The team only discovers it when a customer tweets about failed payments.
This is the core problem: infrastructure metrics don't correlate with user pain. Alert on CPU and you get woken up for things users never notice. Miss the actual user-facing failures because no one set up the right signal.
I've seen this pattern at almost every team that hasn't adopted SLO-based alerting. The dashboard has 200+ panels, PagerDuty fires 15 times a week, engineers are fatigued, and the actual user-impacting issues slip through because nobody set up the right signal. SLIs, SLOs, and SLAs replace this chaos with a structured framework that starts from what users experience and works backward to engineering decisions.
What Is It?
SLIs, SLOs, and SLAs are a three-layer framework (popularized by Google's SRE book) that aligns reliability measurement, reliability goals, and reliability contracts into a coherent engineering system. Each layer builds on the one below it.
SLI (Service Level Indicator): A specific, measurable ratio of user-facing quality. The raw signal. Format: good events / total events. Example: "99.2% of HTTP requests returned a non-5xx response in the last 7 days."
SLO (Service Level Objective): A target range for an SLI. The internal engineering goal. Example: "Availability SLI must be β₯ 99.9% over a rolling 30-day window." This is a commitment your team makes to itself.
SLA (Service Level Agreement): An external contract with customers that defines financial consequences when reliability drops below a threshold. Example: "If monthly uptime falls below 99.5%, customers receive a 10% service credit." SLAs are always more lenient than SLOs because the gap between them is your operating margin.
Think of it like a restaurant. The SLI is how long customers actually wait for their food (measured in minutes). The SLO is the kitchen's internal goal: serve every table within 15 minutes. The SLA is the promise on the menu: "If your food takes more than 30 minutes, it's free." The kitchen targets 15 minutes internally so they never come close to the 30-minute penalty.
The power of this framework is the error budget. Instead of arguing about whether a service is "reliable enough," teams can look at a number: how much of our error budget has been consumed this month? If there's budget remaining, ship features. If it's exhausted, focus on reliability. That converts a subjective argument into an objective one.
SLOs are not 100% uptime targets
The most common misconception: an SLO of 99.9% means "aim for 100%." It does not. If you consistently operate at 99.99% and your SLO is 99.9%, you are over-investing in reliability. That unspent error budget represents features you could have shipped, experiments you could have run, and migrations you could have attempted. The SLO defines the floor you commit to, not the ceiling you aspire to.
How It Works
Step 1: Design your SLIs
An SLI should measure something users actually care about. Bad SLIs measure infrastructure health (CPU, memory). Good SLIs measure user experience using a ratio: good events / total events.
| SLI Type | What It Measures | Example |
|---|---|---|
| Availability | Requests that succeed | (total - 5xx) / total β₯ 99.9% |
| Latency | Requests that are fast enough | requests < 200ms / total β₯ 99.5% |
| Error rate | Requests that don't fail | (total - errors) / total β₯ 99.95% |
| Correctness | Results that are accurate | correct_results / total_results β₯ 99.99% |
| Freshness | Data that is recent enough | fresh_reads / total_reads β₯ 99.9% |
The ratio format keeps every SLI between 0 and 1, making them comparable across services. My recommendation: start with availability and latency SLIs. Those two cover 80% of user-facing quality.
Step 2: Set your SLOs
Pick a target for each SLI and a measurement window. Common windows are 7-day (rolling) and 30-day (calendar month).
The tricky part is choosing the right number. Too aggressive (99.99%) and you'll spend all your engineering time on reliability. Too lenient (99%) and users will leave before you hit the threshold. I'll often start teams at 99.9% for availability and 99.5% for p99 latency as initial targets, then adjust based on actual data.
Step 3: Calculate error budgets
The error budget is the allowed amount of unreliability derived from the SLO:
SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Error budget: 0.1% Γ 43,200 = 43.2 minutes of downtime allowed
SLO: p99 latency < 200ms Β· 99.5% compliance over 7 days
Error budget: 0.5% of requests may exceed 200ms
At 1,000 req/s over 7 days = 604,800,000 requests
Allowed slow requests: 3,024,000
Step 4: Set up burn rate alerting
Simple threshold alerts ("error rate > 0.1%") fire constantly on noise. Burn rate alerting fires when you're consuming budget faster than sustainable:
A burn rate of 14x means you're consuming your 30-day budget 14 times faster than sustainable, so it would run out in roughly 2 days. Requiring both a short window (5 minutes) and a longer window (1 hour) to exceed the threshold filters out brief spikes that would be noise in single-window alerting.
Step 5: Implement with Prometheus
Here's a concrete implementation. The recording rule computes the availability SLI, and the alerting rule fires on high burn rates:
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.