RED Metrics: Monitoring Requests, Errors, and Latency for Microservices

Simran Kumari

December 01, 2025

8 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Table of Contents

If you are running microservices, you already know how quickly things can go wrong. A small spike in latency, a sudden dip in traffic, or a silent error storm can break user experience long before dashboards catch up. This is why the RED Metrics framework has become a foundational approach for SREs and developers who need clear, fast signals without drowning in noise.

In this article, we’ll explore what RED metrics are, why they matter, how to use them in real-world troubleshooting, and how to implement them in OpenObserve using practical SQL queries and dashboards.

What Are RED Metrics?

RED stands for Requests, Errors, and Duration, three core indicators that describe how well a service is performing from a user’s perspective.

Requests: How many requests your service is receiving
Errors: How many of those requests are failing
Duration: How long those requests take to complete

This framework was designed for request-driven systems, especially microservices that handle HTTP, gRPC, or RPC-style traffic. Unlike the Golden Signals (which also include saturation), RED focuses narrowly on what most directly impacts users.

Why RED Metrics Matter

RED metrics give you a quick health snapshot of any service without needing 20 dashboards or 50 PromQL queries. They work because they capture the symptoms that users feel first: slowness, errors, or missing functionality.

Here’s why they’re so valuable:

They align directly with SLOs (availability, latency).
They help reduce alert fatigue by focusing on real user impact.
They’re universally applicable: web apps, APIs, microservices.
They provide fast debugging clues during outages.

If Requests drop suddenly, something upstream broke. If Errors spike, your service is failing. If Duration increases, users will feel latency before you do.

RED Metrics Dashboard

RED Metric

Instead of treating RED as three isolated numbers, think of them as a story.

1. Requests (Traffic & Load)

Requests tell you how much work your service is doing. Tracking this helps you answer simple but critical questions:

Is traffic normal for this hour or day?
Did a new deployment change load patterns?
Are downstream dependencies slowing you down?

A sudden drop in requests usually indicates an upstream routing problem. A sudden spike might mean bots, retries, or cascading failures.

2. Errors (Failures & Reliability)

Errors show you how many requests didn’t succeed. Depending on your architecture, this could include:HTTP 5xx responses, Timeouts, Exceptions, gRPC/internal errors

Error patterns often reveal much more than static CPU/memory charts ever will.

An error spike right after deployment? Rollback.
Errors only for one endpoint? That’s your root cause.
Errors only for one customer segment? That’s a routing or config issue.

3. Duration (Latency & Performance)

Duration tells you how long requests take, but average latency is misleading.
Real systems need p95 and p99 latency to understand tail behavior.

Long-tail latency (e.g., p99) is usually the first thing users complain about.
A small DB slowdown, a cache miss storm, or a slow external service , all of these show up in Duration long before other metrics.

Plotting RED Metrics

When plotting RED metrics in OpenObserve, the X-axis almost always uses a time histogram. Traces arrive with highly-granular timestamps (often microseconds), so plotting them directly creates unreadable charts. Using histogram(_timestamp) groups spans into consistent time buckets such as 10s, 1 min, or 1 hour, giving a smooth view of how traffic, errors, and latency trend over time.

The size of the bucket adapts automatically based on the dashboard’s time range. A short 30-minute range produces finer buckets, while longer ranges switch to wider buckets to keep charts digestible. This bucketing is crucial for multi-service systems, where thousands of spans arrive every minute.

Plotting Requests in OpenObserve

Requests panel can be used to visualize how many request your system processes over time. A line chart works best here because it highlights spikes, surges, and dips clearly. When paired with COUNT(_timestamp) on the Y-axis (which basically counts the number of records), it shows whether load is increasing, decreasing, or behaving abnormally.

Recommended setup

Chart type: Line
X-Axis: histogram(_timestamp)
Y-Axis: COUNT(_timestamp)

Note: You can filter based on different fields.

Creating RED Metric Dashboard in OpenObserve : Requests

Plotting Errors in OpenObserve

Errors are discrete events and often cluster in bursts. A line/bar chart makes these bursts immediately visible, especially when time buckets are small.

Recommended setup

Chart type: Line/Bar
X-Axis: histogram(_timestamp, '1 hour') (or automatic bucket)
Y-Axis: Get count of all events which signifies error for example http_status_code > = 500

Creating RED Metric Dashboard in OpenObserve : Errors

Plotting Duration/Latency in OpenObserve

Latency is inherently continuous and best understood as a trend. A line chart emphasizes changes in performance over time and makes it easy to spot gradual degradation or sharp spikes.

For duration data, percentiles such as p95 or p99 are ideal. When plotted over time buckets, these show tail-latency behavior that averages can never capture.

Recommended setup

Chart type: Line
X-Axis: histogram(_timestamp)
Y-Axis: calculate the p95 or p99 of latency/duration

Creating RED Metric Dashboard in OpenObserve : Duration

Comparison Table: RED vs Golden Signals

Attribute	RED Metrics	Golden Signals
Focus	Microservices	Any system
Includes Saturation?	No	Yes
Best For	API-driven workloads	Infrastructure + Services
SLO Mapping	Very direct	Broader

OpenObserve’s Built-in RED Metrics Panels in the Traces UI

OpenObserve automatically derives Rate, Error, and Duration metrics from your OpenTelemetry traces and visualizes them at the top of the Traces UI. As soon as spans arrive, OpenObserve computes request throughput, error counts, and latency percentiles without requiring any metric exporters, Prometheus setups, or custom dashboards. This gives you RED insights the moment your tracing pipeline is connected.

RED Panel in Traces UI

The Rate panel shows requests-per-second over time, bucketed into short windows (for example, 10–30 seconds depending on your time range). This makes it easy to observe load patterns across your entire system. Spikes, drops, or uneven traffic instantly stand out.
The Errors panel counts spans with error statuses and shows how failures cluster across the selected time window.
The Duration panel visualizes endpoint and service performance using high-signal latency data.

You can choose the time range for which you want to see the data and filter on the error traces to do root cause analysis. Filtering on Errors

Additionally, you can filter based on different fields.

RED Metric panel filtered based on service

Conclusion

RED metrics( Requests, Errors, and Duration) offer a focused, user-centric view of microservice health. By concentrating on the signals that directly affect end-user experience, RED helps teams quickly identify issues, reduce alert noise, and make informed decisions during incidents. When combined with SLO-based alerting and percentile-based latency tracking, RED becomes a reliable foundation for both operational monitoring and performance optimization.

If you want to see RED metrics in action, OpenObserve makes it easy to collect, visualize, and analyze them across your services. From dashboards and endpoint-level breakdowns to burn-rate alerts and trace correlation, OpenObserve provides a unified platform to turn RED metrics into actionable insights.

Try OpenObserve today and get full visibility into your service health with RED metrics.

Next Steps

Build dashboards showing requests per second, error rate, and p95/p99 latency.
Configure SLO-based alerting using error rate and latency thresholds.
Break down metrics by endpoint, service, or customer segment for deeper insights.
Use burn-rate alerts to detect rapid error budget consumption.
Correlate RED metrics with traces to identify the root cause of performance issues.
Iterate dashboards and alert rules to reduce noise and focus on user impact.

FAQS

1. Are RED metrics the same as Golden Signals?

Not exactly. Golden Signals include latency, errors, traffic, and saturation, while RED focuses only on requests, errors, and duration. RED is more specialized for microservices, whereas Golden Signals apply broadly to any system.

2. Should RED metrics use averages or percentiles?

Percentiles such as p95 and p99 are more accurate because they capture tail latency, which represents the worst user experiences. Averages hide spikes and make it harder to detect real performance problems.

3. Can RED metrics be derived from distributed traces?

Yes, especially when using OpenTelemetry. The duration of spans naturally represents latency, status codes indicate success or failure, and the volume of spans per endpoint gives you request counts.

4. Are RED metrics enough by themselves?

They are a strong starting point, but they don’t cover resource saturation, JVM metrics, queue depth, or host-level telemetry. RED should be combined with infra metrics or Golden Signals for complete operational visibility.

5. Why do SRE teams prefer RED during incident response?

RED surfaces the symptoms users feel, errors and slowness: before deeper metrics show anything unusual. It allows responders to quickly isolate problematic endpoints and focus debugging where it matters.

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

How We Built XDrain in Rust and Why It Made Log Pattern Detection Actually Fast

Engineering

OpenObserve

How We Built XDrain in Rust and Why It Made Log Pattern Detection Actually Fast

We rewrote the XDrain log pattern extraction algorithm in Rust, achieving 40x performance improvements over Python. Learn how we used prefix trees, systematic sampling, and memory-bounded LRU caches to process 361,000 logs/sec in real-time.

Head-Based vs. Tail-Based Sampling: Which Should You Use and When?

Learn the difference between head-based and tail-based sampling in observability. Compare pros, cons, and use cases to choose the right strategy for tracing.

The Prometheus Cardinality Bomb: How to Prevent It Before It Blows Up

Learn what the Prometheus cardinality bomb is, why high-cardinality metrics break your monitoring, and how to detect, prevent, and fix it effectively.

Top Observability Tools & Platforms in 2026: The Complete Guide

Explore the top observability tools and platforms in 2026. Compare features, use cases, and alternatives to Datadog for logs, metrics, and traces in this complete guide.

Major Product Update: AI Assistant, LLM Observability & v0.70.0 ( March 2026)

AI Assistant and LLM Observability are now live on OpenObserve Cloud. v0.70.0 brings a rebuilt Service Graph, visual query builder, Incident Timeline, and more.

Best Log Visualization Tools in 2026

Why AI-assisted analysis is changing how engineering teams investigate incidents, and why OpenObserve leads the category.

Top 10 Datadog Competitors in 2026: In-Depth Comparison for DevOps & SRE Teams

Evaluating Datadog competitors? Compare OpenObserve, Grafana, New Relic, Dynatrace, Splunk & more with pricing breakdowns, feature tables, and a step-by-step migration guide. Find the best alternative for your stack in 2026

Top Log Management Tools (Compared & Reviewed)

Compare the best log management tools of 2026- Splunk, Datadog, Loki, OpenObserve & more. Features, pricing, and pros/cons in one guide.

Simran Kumari

2026-03-11

Engineering

Datadog Pricing: The Hidden Costs Every Engineering Team Should Know

Datadog's per-host billing, custom metric taxes, and two-part log pricing can turn a modest monitoring setup into a six-figure annual spend. See how OpenObserve's usage-based pricing compares — no host charges, no OTel penalties, no surprise bills.

OpenTelemetry Collector Contrib: A Comprehensive Guide

Learn how to use the OpenTelemetry Collector Contrib distribution to collect, process, and export telemetry data. This guide covers architecture, key components, configuration examples, and practical deployment tips.

Simran Kumari

2026-03-08