SLO-Based Alerting in OpenObserve

Simran Kumari

August 20, 2025

11 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Get Started For Free

Table of Contents

SLO-Based Alerting in OpenObserve

For SREs and Developers, What You’ll Learn:

What Service Level Objectives (SLOs) are: measurable reliability target
How to define and select meaningful SLOs using SLIs and error budgets
How alerts based on SLOs reduce noise and improve relevance
How to create and automate burn rate alerts in OpenObserve using SQL
Practical SQL examples for error rates, latency, and request throughput

Introduction

Monitoring tools often throw hundreds of alerts like high CPU, slow responses, disk usage. But most of these don’t answer the real question: “Is the user experience impacted?”

That’s what Service Level Objectives (SLOs) are for.

SLOs let you focus on reliability goals tied to real user expectations, not just infrastructure signals. Instead of reacting to everything that could go wrong, you set targets for what must go right, based on your business and real user expectations.

SLO Basics: From Services to Reliability Targets

A service is anything your users rely on, like your website, login system, billing API, or background jobs. Each one is expected to function reliably.

To measure how reliably a service performs, we use Service Level Indicators (SLIs). These are metrics like:

Success rate (e.g., % of 2xx responses)
Error rate (e.g., % of 5xx)
Latency (e.g., 95th percentile response time)
Availability (e.g., uptime %)

A SLO is a target you set for an SLI over a time window. For example, "99.9% of requests should succeed over 7 days."

SLOs are reliability targets that guide operational focus. Instead of chasing infrastructure metrics like CPU or memory usage, SLOs help you focus on what matters: whether the service is available, responsive, and not throwing errors.

For example, your database service might show high CPU usage. That alone doesn’t matter to the service owner unless users are seeing slow queries or failed transactions. SLOs let you ignore noisy alerts and focus only when actual user impact is at risk.

Understanding Error Budgets

Every SLO implies an error budget i.e. the small, acceptable margin for failure. Example: If your SLO is 99.9%, you have an error budget of 0.1% failures in a given time window.

This error budget isn’t just for alerts. It’s a decision-making tool:

Can we release this new feature?
→ If we’re within budget, we might accept some risk.
Should we pause feature work and invest in reliability?
→ If we’re burning the error budget too quickly, yes.
Is this operational risk acceptable right now?
→ Depends on how much of the budget is left and how fast we’re consuming it.

You alert not because the SLO is breached, but when you’re burning through the error budget too fast.

Note: A SLO is not an alert by itself,it’s a long-term reliability target. Alerts are derived from how fast or how often you're deviating from that SLO.

Let’s see how SLOs play out with a real-world example.

SLO Workflow in OpenObserve: Demo

Prerequisites

Before we dive into the demo, make sure you have the following set up:

A running OpenObserve instance: Either self-hosted or an OpenObserve Cloud account.
At least one alert destination configured: You'll need this to receive alerts (e.g., Slack, email, webhook).
Alert templating enabled (recommended)
This allows you to use dynamic values in alert messages.
→ Templating Reference

The scenario: Your users are complaining about slow login experiences. Let's build a complete SLO monitoring system in OpenObserve that tracks, alerts, and helps debug latency issues.

We'll walk through the entire workflow:

Define SLO → Build Dashboard → Create Alerts → Debug Issues when at risk.

Where to Define SLOs in OpenObserve?

OpenObserve doesn’t have a built-in SLO object. You define an SLO by writing a SQL query that evaluates whether your service is meeting the target. You can then:

Add the query result to a dashboard to track trends
Set up a scheduled alert to trigger when the condition fails

In short, the SLO lives in the query and the alert logic you create , giving you full control over how it's defined and enforced.

Step 1: Setting Your Latency SLO

Business Context: Users expect login to feel responsive. Research shows anything above 500ms feels sluggish.

Sample trace data (already flowing into OpenObserve):

Note: OpenTelemetry collects traces in a nested JSON structure (e.g., resource.attributes, scopeSpans.spans). OpenObserve automatically flattens these fields during ingestion for easier querying. The flattening depth is controlled by the environment variable ZO_FLATTEN_LEVEL (default: 3). That’s why in the example below you see simple keys like service_name instead of deeply nested.

{
    "trace_id": "abc123abc123abc123abc123abc123ab",
    "span_id": "def456def456def4",
    "operation_name": "POST /login",
    "start_time": 1754664691403452700,
    "end_time": 1754664691409971500,
    "duration": 518600,
    "http_method": "POST",
    "http_status_code": 200,
    "http_url": "http://authservice.local/login",
    "service_name": "auth-service",
    "span_kind": 3,
    "span_status":"UNSET",
    "status_code":0,
    "status_message":"",
  }

Sample traces in OpenObserve UI

SLO: "95th percentile login response time should stay under 500ms over any 7-day rolling window"

(Note: This threshold is just an example. SLOs vary based on business impact, user expectations, and service risk tolerance.)

Why P95? Averages hide problems. P95 tells us 95% of users get a response faster than this threshold, catching tail latency issues that affect real users.

Step 2: Track your SLO

You need to verify if your service is meeting its defined performance targets.

In the OpenObserve UI , go to Stream section and select your traces stream and click on Explore icon.
Use the Include term icon to filter for respective service, operation name and status code.
After filtering use SQL function to calculate 95th percentile for response time and check if the target is met.

Note:

We're filtering for status_code < 400 because we want to consider only successful or redirected requests (i.e., not client or server errors) when evaluating latency SLOs.
The duration field is in microseconds in this case, so we divide by 1000 to convert the result to milliseconds for comparison with the 500ms SLO target.

So, p95_latency is <500 ms, so our SLO target is not at risk. But do we need to rewrite the query every time to check compliance?

Well , no. Instead, we can create dashboards.

Step 3: Build Your SLO Dashboard

Creating a dashboard helps visualize and observe SLO compliance trends over time without repeating manual checks.

Navigate to OpenObserve → Dashboard and create a new dashboard.
Next, add Panels to your dashboard

Panel 1: SLO Status Summary

Select Chart type -> Markdown
Add a Markdown text with your SLO definition:

## Login Latency SLO

- **Target**: P95 < 500ms over 7 day rolling window
- **Error Budget**: 5% of requests can exceed 500ms
- **Business Impact**: Latency >500ms correlates with 12% higher bounce rate
- **Owner**: Backend Team
- **Escalation**: #backend-oncall

Creating markdown panel in OpenObserve Dashboard

This provides a clear, at-a-glance summary of the SLO, helping teams align on objectives, impact, and ownership.

Panel 2: SLI- Latency Trend Over Time

Create a line chart to capture your latency trends over time.

Select the chart type as Line chart
Select stream type and stream name
Add timestamp in x-axis and P95 of response time / duration on y-axis
Add filters based on operation_name and status_code

Creating line chart to check SLI trends in OpenObserve Dashboard

Panel 3: Current SLO Compliance

Create a gauge chart which shows real-time compliance with the latency SLO to quickly identify if the service is meeting performance targets.

Select Chart type -> Gauge Chart
Select stream type and stream name.
Filter based on operation name, and status code.

Creating gauge chart to check SLO compliance in OpenObserve Dashboard

Your final dashboard may look something like this:

Sample SLO Dashboard in OpenObserve

You can add more panels and charts based on your needs.

Use Dashboard settings to update the default duration as 7-days to avoid manual changes every-time you visit the dashboard:

Dashboard settings in OpenObserve

Pro tip: Set dashboard to auto-refresh every 5 minutes so it stays current. So far, the dashboard tells you what's happening, but someone still has to look at it. That doesn't scale.

To truly defend your SLO, you need to:

Detect when latency is violating your target
Get notified before it impacts users or burns through your error budget

Step 4: Create SLO Alerts

Alerts turn SLO breaches into immediate signals , so your team can act before SLAs (Service Level Agreements) or user experience are impacted.

Setting SLO Breach Alert

Navigate to OpenObserve → Alerts → Add Alert
Fill in Alert-Setup details:
- Give the alert a meaningful name
- Choose stream type and Select corresponding stream from the dropdown
- Set alert type to Scheduled Alert, since we want to evaluate data aggregated over a 1 hr window instead of triggering alerts in real-time

Filling in alert details

Configure alert Settings. Select the corresponding destination where you want to receive the alert notification. The Alerts in OpenObserve documentation provides details around Alerts parameters. Or you can click on the i icon for summary details.
Next, we need to set conditions for alert.
- Select SQL-Mode under the Conditions section on the Alerts page
- Click on View Editor to open the SQL query editor for defining alert conditions
- Paste the Query

SELECT
  service_name,
  APPROX_PERCENTILE_CONT(response_time_ms, 0.95) AS p95_latency_ms
FROM <STREAM_NAME>
WHERE 
  service_name = 'auth-service' 
  AND operation_name= 'POST /login'
  AND status_code < 400
GROUP BY service

This query checks if the 95th percentile latency for the /login endpoint in the auth-service is exceeding the SLO threshold of 500 ms. (In case the duration is micro seconds don’t forget the conversion)

HAVING p95_latency_ms > 500: triggers logic only when the SLO is violated ,making it perfect for alerting.

You can set the Message Template as :

🚨 Login Latency SLO BREACH
P95: {p95_latency_ms}ms (target: <500ms)
Dashboard: https://your-openobserve.com/login-slo

Slack alert notifications using OpenObserve Scheduled alerts

Note: While threshold-based alerts (like p95 latency > 500ms) work well, teams aiming for more resilient and user-centric alerting often use burn rate alerts. Burn rate is the speed at which you're consuming your SLO's error budget and alerts can be tuned for fast vs slow incidents.
For a deeper understanding, see Google's Site Reliability Workbook (Chapter 6: Alerting on SLOs)

Step 5: Debug SLO Violations

Once an alert fires, you need to figure out why.

Focus on these four angles:

Impact Scope : Is it a tail latency issue (only p95 is bad) or widespread (p50 is also slow)?
Who’s Affected : Are certain users, regions, or clients consistently slower?
When It Started : Use time charts to find exactly when latency spiked.
Likely Cause : Correlate with logs or upstream errors to narrow down root causes.

Pro tip: Use drilldown to link to dashboards/logs for quick triage when an alert hits.

Drilldowns feature of OpenObserve Drilldowns options in OpenObserve

Takeaways

SLOs aren’t just theory, they’re your best defense against alert fatigue.

By focusing on what users actually care about, and combining that with OpenObserve’s SQL-powered alerting, you can:

Cut down on noisy, one-off threshold alerts
Surface only the issues that threaten real reliability
Empower devs and SREs to share one clear standard of "good enough"
Start small, and scale your SLO coverage as your systems grow

OpenObserve gives you the flexibility to express these goals in code and turn them into actionable alerts.

Get Started with OpenObserve Today!

Sign up for a 14 day cloud trial. Check out our GitHub repository for self-hosting and contribution opportunities.

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

Engineering

Top 10 Datadog Alternatives in 2025: What to Choose

Explore the top Datadog alternatives in 2025, including open source and SaaS observability platforms for logs, metrics, traces, APM, and OpenTelemetry. Compare features, pricing, and use cases to choose the right monitoring solution for your team.

Simran Kumari

2025-12-24

DataDog vs OpenObserve Part 3: Traces & APM Comparison

Engineering

ComparisonsOpenObserveOpentelemetry

DataDog vs OpenObserve Part 3: Traces & APM Comparison

DataDog vs OpenObserve APM comparison: $120/day LLM charge, SQL trace dashboards, OTel native, service dependency mapping, and 60-90% cost savings with real data.

Top 10 Log Monitoring Tools in 2025: Complete Guide

A comprehensive comparison of the top 10 log monitoring tools in 2025 highlighting their strengths, trade-offs, and use-cases.

Simran Kumari

2025-12-22

DataDog vs OpenObserve Part 2: Metrics Comparison

Engineering

ComparisonsOpenObserveOpentelemetry

DataDog vs OpenObserve Part 2: Metrics Comparison

DataDog vs OpenObserve metrics comparison: PromQL support, high-cardinality handling, custom metrics auto-generation, and 60-90% cost savings with real data.

Convert Raw Logs into Metrics with OpenObserve Pipelines

Learn how to convert raw logs into metrics using OpenObserve pipelines. Step-by-step guide to extract time-series metrics for faster dashboards and reliable alerts.

How to Debug a Real-Time Pipeline in OpenObserve: Complete Guide

A step-by-step guide to debugging real-time observability pipelines in OpenObserve, covering log ingestion, pipeline errors, and troubleshooting.

Simran Kumari

2025-12-15

DataDog vs OpenObserve Part 1: Logs Comparison

Engineering

ComparisonsLoggingObservability

DataDog vs OpenObserve Part 1: Logs Comparison

Real-world comparison of DataDog and OpenObserve for log management. SQL queries, automatic field discovery, and 90% cost savings. Tested with OpenTelemetry.

Manas Sharma

2025-12-12

How Stonal Accelerated Issue Diagnosis and Met European Compliance Standards while Decreasing Observability Costs with OpenObserve

How Stonal's team went from observability budget headaches to full-stack observability with predictable costs and European data residency using OpenObserve.

2025-12-10

Payment Monitoring with Stripe in OpenObserve

How to

LoggingIntegrationsDashboards

Payment Monitoring with Stripe in OpenObserve

Enable end-to-end payment monitoring by sending Stripe events to OpenObserve and analyzing success/failure rates using dashboards, SQL queries, and alerts.

RED Metrics: Monitoring Requests, Errors, and Latency for Microservices

Learn how to use RED Metrics: Requests, Errors, Duration to monitor microservices effectively. Explore dashboards, SLOs, alerts, and OpenObserve examples for real-time observability.

Simran Kumari

2025-12-01