How to Actually Set Meaningful SLOs (Most Teams Are Doing It Wrong)

Simran Kumari

March 18, 2026

12 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Table of Contents

The Problem: SLOs on Paper vs. SLOs in Practice

Let's start with a confession most engineering teams won't make out loud: the majority of SLOs in production today are theater.

Here's the scene. A manager reads a Google SRE blog post on a Sunday evening. Monday morning, they ask the team to "implement SLOs by end of sprint." By Friday, someone has wired up a Prometheus alert on CPU uptime and called it availability: 99.9%. The metric goes on the dashboard. The dashboard goes on the TV in the office. The TV never turns red.

Meanwhile, users are seeing 10-second load times, checkout flows that silently fail, and API errors that return HTTP 200 with a JSON body that says "status": "error". The infrastructure is healthy. The users are not.

This is the central dysfunction: teams confuse infrastructure health with user experience. CPU uptime, memory usage, pod restarts, these are signals that something might be wrong with your systems. They are not signals that something is definitely wrong for your users. A server can have 15% CPU utilization and still be returning 500s to every request.

Real SLOs are built from the user's perspective. They answer one question: "Is this product doing what the user needs it to do, fast enough and reliably enough to keep their trust?"

Everything else is just a health check with a percentage attached.

SLI, SLO and Error Budget: The Three Concepts

Before fixing the anti-patterns, the vocabulary needs to be tight. These three terms are routinely used interchangeably , which is how you end up tracking disk I/O and calling it a reliability target.

Service Level Indicator (SLI)

An SLI is a quantitative measurement of a specific dimension of service quality, expressed as a ratio of good events to total events.

SLI = (good events / total events) × 100

"Good event" needs a crisp definition. For a REST API, a good event might be: a request that returned a 2xx status code in under 500ms. Every request that timed out, returned a 5xx, or took longer than 500ms is a bad event.

The key word is user-observable. If the user cannot feel it, it probably should not be your primary SLI.

Service Level Objective (SLO)

An SLO is a target value for your SLI over a defined time window.

SLO = SLI target over a rolling window

Example: "99.5% of order placement requests return 2xx in under 500ms, measured over a rolling 30-day window."

The time window matters enormously. A 30-day rolling window is far more useful than a calendar-month window because it gives you a continuous, current picture of reliability , not one that resets on the 1st of the month and gives engineers a false sense of a clean slate.

Error Budget

The error budget is the allowable amount of unreliability derived from your SLO. It's what you get to spend on deployments, experiments, and incidents before you've violated your reliability promise.

Error Budget = 1 - SLO target

If your SLO is 99.5%, your error budget is 0.5%. Over a 30-day window (43,200 minutes), that translates to ~216 minutes of allowable downtime , roughly 3 hours and 36 minutes.

The error budget is the forcing function that makes SLOs change behavior. When the budget is healthy, you ship fast and take risks. When it's draining, you slow down and focus on reliability. This is the tradeoff SLOs formalize , reliability vs. velocity , and without the error budget, you have no mechanism to enforce it.

The 5 Anti-Patterns (And What to Do Instead)

Anti-Pattern 1: Tracking Infrastructure Metrics Instead of User Signals

What it looks like:

# This is not an SLO
- name: "CPU Availability"
  metric: avg(cpu_usage_idle) > 0.1
  target: 99.9%

CPU idle percentage, memory utilization, disk I/O , these are leading indicators of potential problems, not measures of user experience. A system can have perfect infrastructure vitals and be silently failing every user request.

What to do instead:

Track HTTP success rate and latency at the application layer, from the user's perspective:

Correctness SLI: % of requests returning a 2xx status code
Latency SLI: % of requests completing under your defined threshold (e.g., 500ms)

If your SLI cannot be directly felt by a user making a request, it belongs in a runbook , not in an SLO.

Anti-Pattern 2: Setting SLOs Too Tight

What it looks like: A team sets 99.99% availability on their first SLO because it "sounds right" for a production system.

99.99% allows only 4.38 minutes of downtime per month. Unless you have fully automated rollback, zero-downtime deployments, multi-region failover, and a battle-hardened incident process, you will blow this budget with a single botched deploy. And once the budget is gone, the SLO becomes a number everyone quietly ignores.

Tight SLOs that are routinely missed train your team to dismiss SLOs entirely.

What to do instead:

Start by measuring your current reliability. Run your SLI query over the past 30 days and see what you're actually achieving. If you're at 98.7%, setting an SLO of 99.9% is aspirational fantasy. Set it at 99.2% , slightly tighter than current performance , to create accountability without setting the team up to fail.

The SLO is not a marketing promise. It is an engineering contract. Set it at a level your team can defend, then tighten it as you improve.

Rule of thumb: Start with current_baseline - 0.3% as your initial SLO, and revisit quarterly.

Anti-Pattern 3: No Burn Rate Alerting

What it looks like: Teams alert on the SLO breach itself , a PagerDuty notification only fires when reliability drops below the target. By the time that fires, the error budget may already be 80% consumed.

What to do instead:

Alert on how fast you're consuming the error budget, not on whether you've already breached the final target. A burn rate of 1.0 means you're on pace to exhaust the budget exactly at month-end. A burn rate of 14.0 means you'll exhaust it in under 2 days.

This gives you early warning and proportional urgency , a page for fast burns (outages), a Slack alert for slow burns (quiet regressions). The full multi-window mechanics are in Section 5.

Anti-Pattern 4: SLOs Per Microservice Instead of User Journeys

What it looks like:

payment-service:   availability 99.9%
cart-service:      availability 99.9%
inventory-service: availability 99.9%
user-service:      availability 99.9%

This looks comprehensive. It is not. A user completing a checkout flow touches all four services. If each has 99.9% availability independently, the joint availability of the checkout flow is:

0.999 × 0.999 × 0.999 × 0.999 ≈ 99.6%

You have four green dashboards and a user journey that's failing 1 in 250 times , and no single service owner sees it as their problem.

What to do instead:

Define SLOs for user journeys, not service instances. The checkout flow is one SLO. The authentication flow is one SLO. Search-and-discovery is one SLO.

Map what users do in your system first. Then instrument it. The service topology is an implementation detail , the user journey is the contract.

Anti-Pattern 5: Never Reviewing Error Budgets

What it looks like: The error budget is calculated once, put on a dashboard, and never discussed again. Teams ship features. Incidents happen. The budget drains. Nobody changes behavior because nobody is looking.

What to do instead:

Run a monthly SLO review , a 30-minute ritual with a fixed agenda:

Error budget status , How much is left? What consumed the most?
Incident debrief , Which incidents hit SLOs? What's the remediation status?
Velocity tradeoff , Should we slow deploys this sprint to protect the budget?
Accuracy check , Is the SLO still measuring the right thing? Does the target need adjustment?

Without this ritual, SLOs are numbers. With it, they become a decision-making tool.

How to Define Your First SLO

The methodology is the same regardless of your stack. Here's the thinking process using a concrete example: the POST /orders endpoint of a REST API.

Step 1: Identify the User Journey

Write this in plain English before touching any tooling.

Journey: A registered user places an order.
Critical user interaction: The user submits the order form and expects a confirmation promptly.

The instrumentation follows the story , not the other way around.

Step 2: Define the SLI

A good SLI combines correctness (did it succeed?) and speed (was it fast enough?).

SLI:

The percentage of POST /orders requests that return a 2xx HTTP status code within 500 milliseconds.

SLI = (requests returning 2xx in < 500ms) / (total requests) × 100

Both dimensions matter. A request that returns 200 after 8 seconds has failed the user even if it's technically "successful."

Step 3: Set the Target

Measure your baseline before committing to a number. Query your actual success rate over the past 30 days. Set the SLO slightly below that , creating accountability without demanding perfection.

Example target: 99.5% over a rolling 30-day window.

Step 4: Calculate the Error Budget

Error Budget = 1 - 0.995 = 0.005 = 0.5%

30-day window  = 43,200 minutes
Allowable bad  = 43,200 × 0.005 = 216 minutes (~3h 36m)

Write this number down and keep it visible. A 20-minute incident consumes ~9% of the monthly budget. A 2-hour incident consumes ~55%. When the budget is concrete, the cost of incidents becomes concrete too.

Step 5: Implement in OpenObserve

Once you have the SLI definition, target, and error budget worked out, the next step is getting it into your observability stack. This means writing the SQL query that evaluates the SLI, building the compliance dashboard, and setting up scheduled alerts that fire when the target is at risk.

The OpenObserve blog has a complete walkthrough using a real-world login latency scenario that maps directly to this process:

SLO-Based Alerting in OpenObserve →

Burn Rate Alerting: The Mental Model

Burn rate is how fast you're consuming your error budget relative to your expected pace.

Burn Rate = (current error rate) / (1 - SLO target)

Burn Rate	What it means	Budget exhausted in
1×	On pace , normal	30 days (end of window)
2×	Draining faster than expected	~15 days
6×	Significant degradation	~5 days
14×	Major incident / outage	~2 days

Why Single-Window Alerts Fail

If you alert on a 5-minute window alone, transient spikes will page your on-call for something that self-resolved 90 seconds later. If you alert only on a 6-hour window, a full outage won't fire for hours.

The solution is multi-window alerting: alert only when burn rate is elevated in both a short and a long window simultaneously. The short window confirms it's real; the long window confirms it's sustained.

Window Pair	Catches	Urgency
5m + 1h	Fast burn (outage)	Page immediately
30m + 6h	Medium burn (degradation)	Page within business hours
2h + 24h	Slow burn (subtle regression)	Slack / ticket

The two-window requirement is noise suppression by design. A 2-minute spike should not wake someone at 3am. A 2-hour sustained burn rate of 14× absolutely should.

Implementing in OpenObserve

OpenObserve's scheduled alerts , which evaluate SQL queries over configurable time windows , map directly to this multi-window model. The following guides cover the alert setup, query structure, and notification routing:

SLO-Driven Monitoring: Build Better Alerts with OpenObserve →

Alerting 101: From Concept to Demo →

Visualizing SLOs in OpenObserve

An SLO that exists only in a config file is not an SLO , it's a query. For SLOs to actually change team behavior, they need to be visible constantly.

The Three Panels Every SLO Dashboard Needs

1. Error Budget Remaining (Gauge)

Shows at a glance how much budget is left for the current 30-day window. Color-code it: green above 50%, yellow at 10–50%, red below 10%. This should be the first thing your team sees at standup , not the deploy count, not the ticket velocity.

2. Burn Rate Over Time (Time Series)

A line chart of current burn rate over the past 7 days. Overlay reference lines at 1× (expected pace), 2× (slow burn threshold), and 14× (fast burn / page threshold). At a glance, your team should be able to see whether the week was clean or problematic.

3. SLI Trend Over Time (Time Series)

Daily SLI values plotted over 30 days with a reference line at the SLO target. This tells you whether reliability is improving, degrading, or flat over time , which is the question the monthly review should be anchored to.

Building It in OpenObserve

For the dashboard build , panel creation, SQL queries, variable templating, and auto-refresh configuration , the OpenObserve documentation covers exactly this:

Building Monitoring Dashboards with OpenObserve →

Observability Dashboards: How to Build Them and What to Show →

Making the Dashboard Matter

The one habit that turns an SLO dashboard from decoration into a decision-making tool: make the error budget gauge the first screen shared at every standup.

When it's green, the conversation is five seconds: "Budget's healthy, let's ship." When it's yellow or red: "What happened, who owns the fix, should we pause the release queue?"

That's the entire promise of SLOs , not a prettier dashboard, but a mechanism for making reliability a first-class engineering decision, visible and discussable, every single day.

Conclusion:

You'll know your SLOs are actually working when:

An engineer says "we should hold this deploy , we're at 12% error budget for the month" without being prompted.
A postmortem asks "how much error budget did this incident consume?" as a standard question.
The monthly SLO review leads to a concrete decision: tighten the target, or deprioritize a feature to pay down reliability debt.

The tools are straightforward. The discipline is the hard part.

Further reading:

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

How to

Observability

Add Full Observability to a New Microservice in Under 30 Minutes

Learn how to set up logs, metrics, and traces for a new microservice in under 30 minutes. A step-by-step guide to achieving full observability quickly and efficiently.

Simran Kumari

2026-04-03

How to

Detecting Frustrated Users Before They Churn: A Deep Dive into OpenObserve's Frustration Signals

Learn how OpenObserve's RUM module automatically detects rage clicks, dead clicks, and error clicks turning invisible UX pain into actionable signals you can see in session replays, query with SQL, and alert on.

AI Anomaly Detection: Catch Issues Traditional Alerts Miss

Complete guide to AI anomaly detection in observability. Discover how machine learning algorithms detect unusual patterns, handle seasonality, and catch issues traditional thresholds miss.

Manas Sharma

2026-04-03

Announcement

OpenChoreo Chooses OpenObserve for Cloud-Native Logging and Tracing

When the OpenChoreo team needed an observability backend for their CNCF sandbox Internal Developer Platform, they chose OpenObserve. Here's why and what it means for Kubernetes teams everywhere.

Simran Kumari

2026-04-01

How to

AI Agent Monitoring: How to Observe Autonomous AI Agents in Production

Learn how to monitor autonomous AI agents in production using observability best practices. Track agent behavior, logs, traces, and performance with tools like OpenTelemetry to ensure reliability, transparency, and control at scale.

Simran Kumari

2026-03-30

Implementing Distributed Tracing in a Java Application with OpenObserve

How to

OpentelemetryApplication

Implementing Distributed Tracing in a Java Application with OpenObserve

Learn how to implement distributed tracing in a Java Spring Boot microservices application using the OpenTelemetry Java Agent and OpenObserve. Covers zero-code auto-instrumentation, JVM metrics, cross-service trace propagation, flamegraphs, and Gantt charts , with working source code and curl examples.

Simran Kumari

2026-03-25

Engineering

Catch Anomalies Before They Become Incidents: Inside OpenObserve's Built-In Detection Engine

Explore how OpenObserve detects anomalies in logs, metrics, and traces to help SREs identify issues early and take action before incidents escalate.

Bhargav Patel,Loakesh Indiran

2026-03-25

How to

AIObservability

AI-Assisted Monitoring via MCP

Learn how AI-assisted monitoring using MCP enhances observability with intelligent alerts, anomaly detection, and automated insights for faster incident response.

Simran Kumari

2026-03-25

Engineering

Best Open Source LLM Observability Tools in 2026: Complete Guide

Discover powerful open source tools for LLM observability. Track prompts, analyze outputs, reduce latency, and improve reliability of your AI applications.

Structured Logging in Production: The Field Guide Nobody Gave You

Learn how to implement structured logging in production. Improve debugging, searchability, and observability with best practices and real-world examples.

Simran Kumari

2026-03-24