How to Actually Set Meaningful SLOs (Most Teams Are Doing It Wrong)


Try OpenObserve Cloud today for more efficient and performant observability.

Let's start with a confession most engineering teams won't make out loud: the majority of SLOs in production today are theater.
Here's the scene. A manager reads a Google SRE blog post on a Sunday evening. Monday morning, they ask the team to "implement SLOs by end of sprint." By Friday, someone has wired up a Prometheus alert on CPU uptime and called it availability: 99.9%. The metric goes on the dashboard. The dashboard goes on the TV in the office. The TV never turns red.
Meanwhile, users are seeing 10-second load times, checkout flows that silently fail, and API errors that return HTTP 200 with a JSON body that says "status": "error". The infrastructure is healthy. The users are not.
This is the central dysfunction: teams confuse infrastructure health with user experience. CPU uptime, memory usage, pod restarts, these are signals that something might be wrong with your systems. They are not signals that something is definitely wrong for your users. A server can have 15% CPU utilization and still be returning 500s to every request.
Real SLOs are built from the user's perspective. They answer one question: "Is this product doing what the user needs it to do, fast enough and reliably enough to keep their trust?"
Everything else is just a health check with a percentage attached.
Before fixing the anti-patterns, the vocabulary needs to be tight. These three terms are routinely used interchangeably , which is how you end up tracking disk I/O and calling it a reliability target.
An SLI is a quantitative measurement of a specific dimension of service quality, expressed as a ratio of good events to total events.
SLI = (good events / total events) × 100
"Good event" needs a crisp definition. For a REST API, a good event might be: a request that returned a 2xx status code in under 500ms. Every request that timed out, returned a 5xx, or took longer than 500ms is a bad event.
The key word is user-observable. If the user cannot feel it, it probably should not be your primary SLI.
An SLO is a target value for your SLI over a defined time window.
SLO = SLI target over a rolling window
Example: "99.5% of order placement requests return 2xx in under 500ms, measured over a rolling 30-day window."
The time window matters enormously. A 30-day rolling window is far more useful than a calendar-month window because it gives you a continuous, current picture of reliability , not one that resets on the 1st of the month and gives engineers a false sense of a clean slate.
The error budget is the allowable amount of unreliability derived from your SLO. It's what you get to spend on deployments, experiments, and incidents before you've violated your reliability promise.
Error Budget = 1 - SLO target
If your SLO is 99.5%, your error budget is 0.5%. Over a 30-day window (43,200 minutes), that translates to ~216 minutes of allowable downtime , roughly 3 hours and 36 minutes.
The error budget is the forcing function that makes SLOs change behavior. When the budget is healthy, you ship fast and take risks. When it's draining, you slow down and focus on reliability. This is the tradeoff SLOs formalize , reliability vs. velocity , and without the error budget, you have no mechanism to enforce it.
What it looks like:
# This is not an SLO
- name: "CPU Availability"
metric: avg(cpu_usage_idle) > 0.1
target: 99.9%
CPU idle percentage, memory utilization, disk I/O , these are leading indicators of potential problems, not measures of user experience. A system can have perfect infrastructure vitals and be silently failing every user request.
What to do instead:
Track HTTP success rate and latency at the application layer, from the user's perspective:
If your SLI cannot be directly felt by a user making a request, it belongs in a runbook , not in an SLO.
What it looks like: A team sets 99.99% availability on their first SLO because it "sounds right" for a production system.
99.99% allows only 4.38 minutes of downtime per month. Unless you have fully automated rollback, zero-downtime deployments, multi-region failover, and a battle-hardened incident process, you will blow this budget with a single botched deploy. And once the budget is gone, the SLO becomes a number everyone quietly ignores.
Tight SLOs that are routinely missed train your team to dismiss SLOs entirely.
What to do instead:
Start by measuring your current reliability. Run your SLI query over the past 30 days and see what you're actually achieving. If you're at 98.7%, setting an SLO of 99.9% is aspirational fantasy. Set it at 99.2% , slightly tighter than current performance , to create accountability without setting the team up to fail.
The SLO is not a marketing promise. It is an engineering contract. Set it at a level your team can defend, then tighten it as you improve.
Rule of thumb: Start with current_baseline - 0.3% as your initial SLO, and revisit quarterly.
What it looks like: Teams alert on the SLO breach itself , a PagerDuty notification only fires when reliability drops below the target. By the time that fires, the error budget may already be 80% consumed.
What to do instead:
Alert on how fast you're consuming the error budget, not on whether you've already breached the final target. A burn rate of 1.0 means you're on pace to exhaust the budget exactly at month-end. A burn rate of 14.0 means you'll exhaust it in under 2 days.
This gives you early warning and proportional urgency , a page for fast burns (outages), a Slack alert for slow burns (quiet regressions). The full multi-window mechanics are in Section 5.
What it looks like:
payment-service: availability 99.9%
cart-service: availability 99.9%
inventory-service: availability 99.9%
user-service: availability 99.9%
This looks comprehensive. It is not. A user completing a checkout flow touches all four services. If each has 99.9% availability independently, the joint availability of the checkout flow is:
0.999 × 0.999 × 0.999 × 0.999 ≈ 99.6%
You have four green dashboards and a user journey that's failing 1 in 250 times , and no single service owner sees it as their problem.
What to do instead:
Define SLOs for user journeys, not service instances. The checkout flow is one SLO. The authentication flow is one SLO. Search-and-discovery is one SLO.
Map what users do in your system first. Then instrument it. The service topology is an implementation detail , the user journey is the contract.
What it looks like: The error budget is calculated once, put on a dashboard, and never discussed again. Teams ship features. Incidents happen. The budget drains. Nobody changes behavior because nobody is looking.
What to do instead:
Run a monthly SLO review , a 30-minute ritual with a fixed agenda:
Without this ritual, SLOs are numbers. With it, they become a decision-making tool.
The methodology is the same regardless of your stack. Here's the thinking process using a concrete example: the POST /orders endpoint of a REST API.
Write this in plain English before touching any tooling.
Journey: A registered user places an order.
Critical user interaction: The user submits the order form and expects a confirmation promptly.
The instrumentation follows the story , not the other way around.
A good SLI combines correctness (did it succeed?) and speed (was it fast enough?).
SLI:
The percentage of POST /orders requests that return a 2xx HTTP status code within 500 milliseconds.
SLI = (requests returning 2xx in < 500ms) / (total requests) × 100
Both dimensions matter. A request that returns 200 after 8 seconds has failed the user even if it's technically "successful."
Measure your baseline before committing to a number. Query your actual success rate over the past 30 days. Set the SLO slightly below that , creating accountability without demanding perfection.
Example target: 99.5% over a rolling 30-day window.
Error Budget = 1 - 0.995 = 0.005 = 0.5%
30-day window = 43,200 minutes
Allowable bad = 43,200 × 0.005 = 216 minutes (~3h 36m)
Write this number down and keep it visible. A 20-minute incident consumes ~9% of the monthly budget. A 2-hour incident consumes ~55%. When the budget is concrete, the cost of incidents becomes concrete too.
Once you have the SLI definition, target, and error budget worked out, the next step is getting it into your observability stack. This means writing the SQL query that evaluates the SLI, building the compliance dashboard, and setting up scheduled alerts that fire when the target is at risk.
The OpenObserve blog has a complete walkthrough using a real-world login latency scenario that maps directly to this process:
SLO-Based Alerting in OpenObserve →
Burn rate is how fast you're consuming your error budget relative to your expected pace.
Burn Rate = (current error rate) / (1 - SLO target)
| Burn Rate | What it means | Budget exhausted in |
|---|---|---|
| 1× | On pace , normal | 30 days (end of window) |
| 2× | Draining faster than expected | ~15 days |
| 6× | Significant degradation | ~5 days |
| 14× | Major incident / outage | ~2 days |
If you alert on a 5-minute window alone, transient spikes will page your on-call for something that self-resolved 90 seconds later. If you alert only on a 6-hour window, a full outage won't fire for hours.
The solution is multi-window alerting: alert only when burn rate is elevated in both a short and a long window simultaneously. The short window confirms it's real; the long window confirms it's sustained.
| Window Pair | Catches | Urgency |
|---|---|---|
| 5m + 1h | Fast burn (outage) | Page immediately |
| 30m + 6h | Medium burn (degradation) | Page within business hours |
| 2h + 24h | Slow burn (subtle regression) | Slack / ticket |
The two-window requirement is noise suppression by design. A 2-minute spike should not wake someone at 3am. A 2-hour sustained burn rate of 14× absolutely should.
OpenObserve's scheduled alerts , which evaluate SQL queries over configurable time windows , map directly to this multi-window model. The following guides cover the alert setup, query structure, and notification routing:
SLO-Driven Monitoring: Build Better Alerts with OpenObserve →
Alerting 101: From Concept to Demo →
An SLO that exists only in a config file is not an SLO , it's a query. For SLOs to actually change team behavior, they need to be visible constantly.
1. Error Budget Remaining (Gauge)
Shows at a glance how much budget is left for the current 30-day window. Color-code it: green above 50%, yellow at 10–50%, red below 10%. This should be the first thing your team sees at standup , not the deploy count, not the ticket velocity.
2. Burn Rate Over Time (Time Series)
A line chart of current burn rate over the past 7 days. Overlay reference lines at 1× (expected pace), 2× (slow burn threshold), and 14× (fast burn / page threshold). At a glance, your team should be able to see whether the week was clean or problematic.
3. SLI Trend Over Time (Time Series)
Daily SLI values plotted over 30 days with a reference line at the SLO target. This tells you whether reliability is improving, degrading, or flat over time , which is the question the monthly review should be anchored to.
For the dashboard build , panel creation, SQL queries, variable templating, and auto-refresh configuration , the OpenObserve documentation covers exactly this:
Building Monitoring Dashboards with OpenObserve →
Observability Dashboards: How to Build Them and What to Show →
The one habit that turns an SLO dashboard from decoration into a decision-making tool: make the error budget gauge the first screen shared at every standup.
When it's green, the conversation is five seconds: "Budget's healthy, let's ship." When it's yellow or red: "What happened, who owns the fix, should we pause the release queue?"
That's the entire promise of SLOs , not a prettier dashboard, but a mechanism for making reliability a first-class engineering decision, visible and discussable, every single day.
You'll know your SLOs are actually working when:
The tools are straightforward. The discipline is the hard part.
Further reading: