Our Alerts Are Noise: How Do We Actually Fix Alert Fatigue?


Try OpenObserve Cloud today for more efficient and performant observability.

You're on-call. It's 2 AM. Your phone buzzes , again. 47 alerts in the last hour. You've checked the dashboards three times. Everything looks… fine. So you silence the alerts and go back to sleep. Two hours later, a real incident slips through.
That's not a hypothetical. That's Tuesday.
Alert fatigue is one of the most quietly destructive problems in modern engineering teams. It doesn't show up in a post-mortem. It accumulates , alert by alert, ignored page by ignored page , until your on-call rotation becomes a coin flip between panic and apathy.
This article breaks down why alert fatigue happens, how SLO-based alerting changes the game, and the exact steps to set up your first burn rate alert today.
Before we fix anything, let's be honest about how teams end up here.
When you spin up a new tool , a cloud provider, an APM platform, a logging solution , it comes with a bundle of pre-built alerts. CPU > 80%. Memory > 70%. Error rate > 1%. Teams click "Enable All" and ship it.
These defaults are built to cover every possible customer. They are not built for your service, your traffic patterns, or your users. A batch-processing service that regularly spikes to 90% CPU during a scheduled job will page your team every night , for something that is entirely expected behavior.
Alerts are easy to create and painful to maintain. Everyone adds them; nobody removes them. There's no process, no owner, no review cycle. Six months later you have 300 alerts, half of which were written by someone who left the company.
Without ownership, alerts become archaeological artifacts , each one a mystery, none of them trusted.
The most common mistake: alerting on system metrics instead of user-facing symptoms.
Your database CPU is at 85%. Should you page someone? Maybe. But the real question is: are users actually experiencing degraded performance right now? If response times are normal and error rates are flat, the answer is probably no.
System metrics are leading indicators at best, false alarms at worst. What you actually care about is whether your users are having a bad time.
Meet Priya. She's on-call this Thursday. Here's her shift:
| Metric | Count |
|---|---|
| Total alerts fired | 200 |
| Alerts requiring action | 10 |
| True incidents | 2 |
| False positives / noise | 190 |
Priya spends 4 hours investigating alerts that resolve themselves. She misses a slow memory leak that builds over 3 hours , it never crossed any single threshold hard enough to fire a critical alert , and by morning, a service is down.
This is what threshold alerting looks like at scale. The signal is buried in noise. The team burns out. Trust in the alerting system collapses.
Google's Site Reliability Engineering book introduced a framework that fundamentally reframes how we think about alerts: only page someone when it matters to the user.
The machinery behind it is surprisingly approachable.
Service Level Indicator (SLI) A specific, measurable signal of user experience. The most common: the proportion of requests that are successful and fast.
SLI = (good requests) / (total requests)
Service Level Objective (SLO) Your target reliability over a rolling window. Example: "99.9% of requests succeed over a 30-day window."
Error Budget The flip side of your SLO. At 99.9% availability, you are allowed 0.1% failure , that's ~43.8 minutes of downtime per month. This is your error budget. Spend it intentionally (deploys, experiments) or protect it aggressively (incidents, regressions).
Burn Rate How fast you are consuming your error budget right now compared to the sustainable rate.
The core principle: Don't alert on whether something is broken. Alert on whether you're burning through your error budget faster than you can afford.
One of the most elegant ideas in Google's SRE Workbook is combining a short window and a long window burn rate check in the same alert.
| Window | Purpose |
|---|---|
| Short (5 minutes) | Catches sudden, severe outages quickly. High sensitivity. |
| Long (1 hour) | Confirms the burn is real and sustained , not a 30-second blip. |
Both conditions must be true to fire the alert.
This eliminates two common failure modes:
The result: fewer alerts, higher confidence, less time wasted.
Here's a practical walkthrough. We'll use a web API as the example service.
Pick the metric that best represents "is the user having a good experience?"
For most HTTP services:
SLI = successful_requests / total_requests
Where "successful" typically means: HTTP status not 5xx and latency under your target (e.g., p99 < 500ms).
In OpenObserve, you can express this as a metric query or a log-based aggregation over your ingested data.
Start conservative. For most internal services, 99.5% over 30 days is a reasonable starting point. For customer-facing APIs, consider 99.9%.
SLO target: 99.9%
Error budget: 0.1% = ~43.8 minutes/month
Document this. Put it somewhere your whole team can see it. Your SLO is meaningless if only one person knows what it is.
For a page-worthy alert (wake someone up):
Burn rate threshold: 14.4×
At this rate, 30-day budget burns in ~50 hours
For a warning / ticket alert (low urgency, fix during business hours):
Burn rate threshold: 6×
At this rate, 30-day budget burns in ~5 days
These numbers come directly from Google's recommended alerting thresholds and are a solid default for most teams.
# Example: Prometheus / OpenObserve alert rule
alert: HighErrorBudgetBurnRate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
) > (14.4 * 0.001) # 14.4x burn rate on short window
AND
(
rate(http_requests_total{status=~"5.."}[1h])
/
rate(http_requests_total[1h])
) > (14.4 * 0.001) # 14.4x burn rate on long window
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate on {{ $labels.service }}"
description: >
Error budget is burning at >14.4x the sustainable rate.
At this pace, the 30-day budget will be exhausted in ~50 hours.
Runbook: https://wiki.yourcompany.com/runbooks/high-error-budget-burn
💡 If you're using OpenObserve, you can configure SLO-based alerts directly in the Alerts UI without writing PromQL from scratch. The platform supports both log-based and metrics-based SLIs. SLO-based Alerting Guide for OpenObserve .
Before this alert goes into production rotation:
An untested alert is a surprise waiting to happen at 3 AM.
You don't need to rebuild everything overnight. Start here.
No runbook, no alert. Full stop.
A runbook doesn't have to be a 10-page wiki. It needs to answer three questions:
If you can't answer those three questions, you don't understand the alert well enough to be paging someone about it. Link the runbook directly in the alert annotation. The on-call engineer shouldn't have to go hunting at 3 AM.
Not everything is an emergency. Route accordingly:
| Urgency | Action |
|---|---|
| Critical (user impact now) | Page on-call immediately |
| Warning (budget burning slowly) | Create a ticket, fix during business hours |
| Info (trend to watch) | Log it, review in weekly meeting |
The moment you start paging people for things that don't need immediate action, they stop trusting pages that do.
Put it in the calendar. Once a quarter, your team reviews every active alert and asks:
Make someone responsible for this. Without an owner, the review never happens.
If an alert has fired (or could have fired) and no one acted on it in six months, it's not useful , it's noise. Archive it. Delete it. Let it go.
This one will feel scary. Do it anyway. You can always recreate an alert. You can't get back the trust your team lost by crying wolf for six months.
A common anti-pattern: every service routes all its alerts to its own Slack channel and PagerDuty rotation. The result is that each team gets flooded with everything, regardless of severity.
Route alerts based on what action they require:
critical alerts → on-call PagerDuty rotation (immediate wake-up)
warning alerts → #alerts-warning Slack channel (next business day)
info alerts → #observability-metrics Slack channel (weekly review)
This makes severity meaningful. When someone gets paged, they know it's real.
Alert fatigue isn't a tool problem or a vendor problem. It's a philosophy problem.
Threshold alerting asks: "Did something cross a line?" SLO-based alerting asks: "Is the user experience degrading faster than we can afford?". That shift in framing changes everything , the alerts you write, the thresholds you set, the conversations you have with your team and your stakeholders.
Your one action this week: Pick one service. Define a single SLI. Set a 30-day SLO. Write one dual-window burn rate alert with a runbook. Retire three threshold alerts you don't trust. That's it. You don't have to boil the ocean. You just have to start.
Start a Free Trial of OpenObserve , set up SLO-based alerting without the infrastructure overhead