SLIs, SLOs, and SLAs
How SLIs, SLOs, and SLAs work together with error budgets, burn rate alerting, the difference between an SLO target and an SLA contract, and how reliability goals drive engineering decisions.
TL;DR
- SLIs measure what users experience (availability, latency, error rate). SLOs set the internal target. SLAs are the external contract with financial penalties.
- Good SLIs use ratio format:
good events / total events. Measure user experience, not infrastructure health. CPU at 80% is not an SLI. - Error budgets convert reliability targets into engineering policy. A 99.9% SLO = 43 minutes of allowed downtime per 30 days. When budget is healthy, ship fast. When depleted, freeze features.
- Burn rate alerting fires when you're consuming budget faster than sustainable, not just when a threshold is breached. Multi-window alerting (5min + 1hr) separates real incidents from noise.
- SLOs drive architecture decisions: if a dependency threatens your budget, that's the signal to add redundancy, caching, or circuit breakers.
The Problem It Solves
It's 3 a.m. and your on-call engineer gets paged. Grafana shows CPU at 92% on three app servers. She spends 40 minutes investigating, scales up the fleet, and goes back to sleep. The next morning, she learns that users were completely unaffected: the CPU spike was a background batch job, and all user-facing requests were fine.
The same week, a different incident happens. A silent database connection leak causes 2% of checkout requests to fail with 500 errors for six hours. No alert fires because CPU, memory, and disk all look healthy. The team only discovers it when a customer tweets about failed payments.
This is the core problem: infrastructure metrics don't correlate with user pain. Alert on CPU and you get woken up for things users never notice. Miss the actual user-facing failures because no one set up the right signal.
I've seen this pattern at almost every team that hasn't adopted SLO-based alerting. The dashboard has 200+ panels, PagerDuty fires 15 times a week, engineers are fatigued, and the actual user-impacting issues slip through because nobody set up the right signal. SLIs, SLOs, and SLAs replace this chaos with a structured framework that starts from what users experience and works backward to engineering decisions.
What Is It?
SLIs, SLOs, and SLAs are a three-layer framework (popularized by Google's SRE book) that aligns reliability measurement, reliability goals, and reliability contracts into a coherent engineering system. Each layer builds on the one below it.
SLI (Service Level Indicator): A specific, measurable ratio of user-facing quality. The raw signal. Format: good events / total events. Example: "99.2% of HTTP requests returned a non-5xx response in the last 7 days."
SLO (Service Level Objective): A target range for an SLI. The internal engineering goal. Example: "Availability SLI must be โฅ 99.9% over a rolling 30-day window." This is a commitment your team makes to itself.
SLA (Service Level Agreement): An external contract with customers that defines financial consequences when reliability drops below a threshold. Example: "If monthly uptime falls below 99.5%, customers receive a 10% service credit." SLAs are always more lenient than SLOs because the gap between them is your operating margin.
Think of it like a restaurant. The SLI is how long customers actually wait for their food (measured in minutes). The SLO is the kitchen's internal goal: serve every table within 15 minutes. The SLA is the promise on the menu: "If your food takes more than 30 minutes, it's free." The kitchen targets 15 minutes internally so they never come close to the 30-minute penalty.
The power of this framework is the error budget. Instead of arguing about whether a service is "reliable enough," teams can look at a number: how much of our error budget has been consumed this month? If there's budget remaining, ship features. If it's exhausted, focus on reliability. That converts a subjective argument into an objective one.
SLOs are not 100% uptime targets
The most common misconception: an SLO of 99.9% means "aim for 100%." It does not. If you consistently operate at 99.99% and your SLO is 99.9%, you are over-investing in reliability. That unspent error budget represents features you could have shipped, experiments you could have run, and migrations you could have attempted. The SLO defines the floor you commit to, not the ceiling you aspire to.
How It Works
Step 1: Design your SLIs
An SLI should measure something users actually care about. Bad SLIs measure infrastructure health (CPU, memory). Good SLIs measure user experience using a ratio: good events / total events.
| SLI Type | What It Measures | Example |
|---|---|---|
| Availability | Requests that succeed | (total - 5xx) / total โฅ 99.9% |
| Latency | Requests that are fast enough | requests < 200ms / total โฅ 99.5% |
| Error rate | Requests that don't fail | (total - errors) / total โฅ 99.95% |
| Correctness | Results that are accurate | correct_results / total_results โฅ 99.99% |
| Freshness | Data that is recent enough | fresh_reads / total_reads โฅ 99.9% |
The ratio format keeps every SLI between 0 and 1, making them comparable across services. My recommendation: start with availability and latency SLIs. Those two cover 80% of user-facing quality.
Step 2: Set your SLOs
Pick a target for each SLI and a measurement window. Common windows are 7-day (rolling) and 30-day (calendar month).
The tricky part is choosing the right number. Too aggressive (99.99%) and you'll spend all your engineering time on reliability. Too lenient (99%) and users will leave before you hit the threshold. I'll often start teams at 99.9% for availability and 99.5% for p99 latency as initial targets, then adjust based on actual data.
Step 3: Calculate error budgets
The error budget is the allowed amount of unreliability derived from the SLO:
SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Error budget: 0.1% ร 43,200 = 43.2 minutes of downtime allowed
SLO: p99 latency < 200ms ยท 99.5% compliance over 7 days
Error budget: 0.5% of requests may exceed 200ms
At 1,000 req/s over 7 days = 604,800,000 requests
Allowed slow requests: 3,024,000
Step 4: Set up burn rate alerting
Simple threshold alerts ("error rate > 0.1%") fire constantly on noise. Burn rate alerting fires when you're consuming budget faster than sustainable:
A burn rate of 14x means you're consuming your 30-day budget 14 times faster than sustainable, so it would run out in roughly 2 days. Requiring both a short window (5 minutes) and a longer window (1 hour) to exceed the threshold filters out brief spikes that would be noise in single-window alerting.
Step 5: Implement with Prometheus
Here's a concrete implementation. The recording rule computes the availability SLI, and the alerting rule fires on high burn rates:
# Prometheus recording rule: compute the availability SLI
groups:
- name: slo_recording_rules
rules:
- record: sli:availability:ratio_rate5m
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
- record: sli:availability:ratio_rate1h
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
- name: slo_burn_rate_alerts
rules:
# Fast burn: page if 5m AND 1h both show 14x consumption
- alert: SLOBurnRateCritical
expr: |
(1 - sli:availability:ratio_rate5m) / (1 - 0.999) > 14
and
(1 - sli:availability:ratio_rate1h) / (1 - 0.999) > 14
for: 2m
labels:
severity: page
annotations:
summary: "Availability SLO burning at >14x rate"
# Slow burn: ticket if 1h AND 6h both show 6x consumption
- alert: SLOBurnRateSlow
expr: |
(1 - sli:availability:ratio_rate1h) / (1 - 0.999) > 6
and
(1 - sli:availability:ratio_rate6h) / (1 - 0.999) > 6
for: 5m
labels:
severity: ticket
Step 6: Create an error budget policy
The error budget policy defines what happens at different budget levels:
| Budget Remaining | Engineering Response |
|---|---|
| > 50% | Ship freely, run experiments, attempt risky migrations |
| 20-50% | Increase canary duration, slow non-essential changes, review recent regressions |
| 5-20% | Freeze non-critical features, prioritize reliability work, post-mortem recent incidents |
| 0% (exhausted) | Full feature freeze until budget recovers, all hands on reliability |
For your interview: describe this as "the error budget is the arbiter between product velocity and reliability, and the policy converts a subjective 'how reliable is reliable enough?' into an objective, data-driven process."
Key Components
| Component | Role |
|---|---|
| SLI | Raw measurement of user-facing quality. Ratio format: good events / total events. |
| SLO | Internal target for an SLI over a time window (e.g., โฅ 99.9% over 30 days). |
| SLA | External contract with penalties (credits, refunds) when reliability drops below threshold. |
| Error Budget | Allowed unreliability derived from SLO: (1 - SLO) ร window. The currency of risk. |
| Burn Rate | Speed of error budget consumption. 1x = sustainable, 14x = budget exhausts in ~2 days. |
| Error Budget Policy | Rules for engineering response at different budget levels (ship/slow/freeze). |
| SLO Dashboard | Real-time visibility into budget consumption per service and SLI. |
| SLO Review | Monthly or quarterly ceremony to evaluate and adjust SLO targets based on data. |
Types / Variations
Not all SLIs are created equal. The right SLI depends on what your users care about most.
| SLI Type | Formula | Good Window | When to Use |
|---|---|---|---|
| Availability | (total - 5xx) / total | 30 days | APIs, web services: the "are we up?" signal |
| Latency | requests < threshold / total | 7 days | User-facing pages, real-time APIs |
| Throughput | served / attempted | 7 days | Batch processing, data pipelines |
| Error rate | (total - errors) / total | 30 days | Payment APIs, critical transactions |
| Correctness | correct / total | 30 days | Financial calculations, search relevance |
| Freshness | fresh_reads / total_reads | 7 days | Dashboards, analytics, cached data |
| Durability | recoverable / stored | Quarterly | Storage systems, backup services |
I'll often see teams start with just availability and call it done. That's a mistake because a service can be "up" (returning 200s) while serving stale data or responding in 5 seconds. My recommendation: every user-facing service needs at least availability + latency SLIs. Payment and data services need correctness too.
SLO Tiers: Internal vs. External
| Aspect | SLO (Internal) | SLA (External) |
|---|---|---|
| Audience | Engineering team | Customers, contracts |
| Target | Aggressive: 99.9% | Conservative: 99.5% |
| Consequence of violation | Error budget policy kicks in | Financial penalties, credits |
| Flexibility | Adjusted quarterly based on data | Locked into contract terms |
| Gap between them | The gap is your operating margin | Must always be more lenient |
The gap matters. If your SLO is 99.9% and your SLA is 99.5%, you have a 0.4% buffer (about 2.9 hours per month). That buffer is where you catch problems before they become contractual violations. Never set your SLA equal to your SLO.
Trade-offs
| Advantage | Disadvantage |
|---|---|
| Alerts correlate with actual user impact | Requires instrumentation at the request level, not just infrastructure |
| Error budgets give product and reliability a shared language | Cultural shift is hard: teams must accept that 100% is not the goal |
| Burn rate alerting dramatically reduces false-positive pages | Initial setup cost is significant (recording rules, dashboards, policies) |
| SLOs drive objective architecture decisions | Poorly chosen SLIs can create a false sense of security |
| Error budget policy removes subjective reliability arguments | Teams may game SLIs (excluding error classes, narrowing scope) |
| Monthly review cycle forces continuous improvement | Requires organizational buy-in, not just tooling |
The fundamental tension is user-signal fidelity vs. operational investment. SLO-based reliability engineering gives you far better signal about when users are actually hurting, but it requires meaningful upfront investment in instrumentation, cultural change, and ongoing SLO review discipline. The alternative (alerting on CPU and hoping for the best) is cheaper to set up and worse at everything else.
When to Use It / When to Avoid It
Use SLOs when:
- Any service with real users: If humans depend on your service (internal or external), SLOs should define what "working" means. This includes internal tools that engineers use daily.
- Multi-team organizations: SLOs give teams a shared vocabulary for reliability. Instead of "the API feels slow," you get "the latency SLI dropped below target for 3 hours."
- Microservice architectures: Each service needs its own SLOs. When service A depends on service B, A's SLO depends on B's SLO. This makes dependency chains explicit.
- You need to prioritize reliability investment: Error budgets tell you exactly where to spend engineering time. If service X is burning budget fast, that's where your next sprint goes.
- On-call is painful: If your on-call engineers are drowning in false positives, SLO-based burn rate alerting will likely cut your page volume by 50-80%.
Avoid (or defer) SLOs when:
- Prototype or MVP stage: You don't have enough traffic or stability to set meaningful targets. Ship first, then define SLOs once you have 2-4 weeks of production data.
- Internal tool with 3 users: The overhead of SLI instrumentation, dashboards, and error budget policies isn't worth it for a tool that three engineers use occasionally.
- Pure batch processing: Traditional SLOs measure request-level quality. Batch jobs need different SLIs (completion rate, freshness, throughput) that don't map cleanly to the standard ratio format.
- You don't have observability infrastructure: SLOs require request-level metrics (not just server-level). If you can't measure
good_requests / total_requests, instrument first.
The honest answer: if you have more than one team, more than one service, and real users, you need SLOs. Start with your most critical user journey and expand from there.
Real-World Examples
Google pioneered SLO-driven culture through the SRE book (2016). Every Google service has explicit SLOs, and error budgets drive engineering prioritization. When a service exhausts its budget, a formal error budget policy kicks in: feature freezes, mandatory postmortems, and reliability-focused sprints. Google's internal SLO tooling tracks hundreds of SLIs across thousands of services.
Slack runs SLO dashboards for every production service. Their public status page shows real-time SLI data, and internal teams track burn rates daily. Slack's approach to error budgets is particularly strict: when a critical service (messaging, calls) burns 50% of its monthly budget in one incident, the owning team pauses feature work for the remainder of the month. This policy has measurably reduced repeat incidents.
Stripe publishes a 99.999% availability SLA for their core payment API, one of the most aggressive in the industry. Internally, their SLOs are even tighter. Stripe's approach to SLIs is notable: they measure availability from the customer's perspective (not server-side metrics), using synthetic monitoring and real-user measurement to compute SLIs that reflect actual payment success rates. At their scale (billions of API requests per year), even a 99.99% SLO allows only ~52 minutes of downtime annually.
How This Shows Up in Interviews
When to bring it up
Mention SLOs when the interviewer asks about monitoring, alerting, or reliability. If you're designing a system and discussing non-functional requirements, frame reliability in SLO terms: "I'd set a 99.9% availability SLO and a p99 latency SLO under 200ms." This signals that you think about reliability as a measurable, managed property rather than just "make it not crash."
Also bring up error budgets when discussing trade-offs between feature velocity and reliability. "We'd use error budgets to decide whether we can afford this risky migration" shows mature engineering judgment.
Depth expected at senior/staff level
- SLI design: Ratio format, measuring user experience (not infra), choosing the right SLI types for the service.
- Error budget math: Calculate budget from SLO + window, explain what budget levels mean for engineering behavior.
- Burn rate alerting: Multi-window approach, why single-threshold alerts are noisy, what burn rate numbers (14x, 6x) mean.
- Error budget policy: What happens when budget is exhausted. Feature freeze, postmortem requirements, reliability focus.
- SLO vs. SLA gap: Why the SLA must be more lenient, what the gap represents (operating margin).
- SLO-driven architecture: How SLOs influence decisions about redundancy, caching, circuit breakers, and dependency management.
Interview move: anchor your NFRs in SLO language
Instead of saying "the system should be highly available," say "I'd target a 99.9% availability SLO, which gives us 43 minutes of downtime per month. That means we need at least N+1 redundancy and automated failover that completes in under 30 seconds." Specific numbers, specific implications. Interviewers love this.
Follow-up Q&A
| Interviewer asks | Strong answer |
|---|---|
| "How do you decide the right SLO target?" | "Start with what users tolerate. Look at current performance data for a baseline, then set the SLO slightly below your actual p50 performance. If you're currently at 99.95%, set the SLO at 99.9%. Adjust quarterly based on user feedback and budget consumption." |
| "What's the difference between SLO and SLA?" | "SLOs are internal targets that drive engineering behavior. SLAs are external contracts with financial consequences. The SLA is always more lenient than the SLO, and the gap between them is your operating margin. Violating an SLO is a signal to act; violating an SLA costs you money." |
| "How do error budgets prevent over-reliability?" | "If your SLO is 99.9% but you're consistently at 99.99%, you have unspent budget. That means you could be shipping faster, running more experiments, or doing riskier migrations. The budget gives product teams ammunition to push for velocity when reliability is healthy." |
| "What happens when two teams disagree about reliability priority?" | "The error budget settles it. If team A's service is at 60% budget remaining and team B's is at 5%, team B's reliability work gets priority. No subjective arguments needed." |
| "Why not just alert on error rate?" | "A flat error rate threshold doesn't account for how fast you're consuming your monthly budget. A 0.5% error spike for 2 minutes is noise. A 0.3% sustained error rate for 6 hours quietly eats your entire budget. Burn rate alerting catches both: fast burns page immediately, slow burns create tickets." |
Test Your Understanding
Quick Recap
- SLIs measure what users experience (availability, latency, error rate) using ratio format: good events / total events. They replace infrastructure metrics as the primary reliability signal.
- SLOs set internal targets for SLIs. SLAs are external contracts with financial penalties. The SLA must always be more lenient than the SLO, and the gap is your operating margin.
- Error budgets convert SLOs into engineering policy. A 99.9% SLO over 30 days gives you 43 minutes of allowed downtime. Budget levels dictate engineering behavior: ship freely, slow down, or freeze features.
- Burn rate alerting fires when budget is being consumed faster than sustainable. Multi-window alerting (5min + 1hr for fast burns, 1hr + 6hr for slow burns) dramatically reduces false-positive pages compared to simple threshold alerts.
- SLOs drive architecture decisions. If a dependency threatens your error budget, that's the objective signal to add redundancy, implement circuit breakers, or introduce caching. The budget quantifies the cost of unreliability.
- Operating at 100% reliability is a waste. If your error budget is never spent, your SLO is too conservative or you're over-investing in reliability at the expense of feature velocity. The budget exists to be spent.
Related Concepts
- Observability: SLIs are computed from observability data (metrics, logs, traces). You need request-level instrumentation before you can define meaningful SLIs.
- Distributed Tracing: When a burn rate alert fires, distributed tracing is how you find root cause across services. SLOs tell you that something is wrong; traces tell you where.
- Circuit Breaker: Circuit breakers protect your error budget by preventing cascade failures. When a dependency's SLO drops, the circuit breaker opens to preserve your own SLO.