Observability Dashboards: What to Show and How to Build Them

Simran Kumari

September 25, 2025

10 min read

Don’t forget to share!

Table of Contents

Observability Dashboards: What to Show and How to Build Them

What you will learn

The dashboards most teams actually need (and the KPIs to plot)
Design patterns that speed up triage and reduce noise
Concrete OpenObserve examples (queries, variables, alerts)

The Importance of Observability Dashboards in Cloud Environments

In modern cloud-native environments, teams run distributed microservices across Kubernetes clusters, serverless functions, and managed infrastructure. While this improves scalability, it also adds complexity.

When something goes wrong, you can’t simply log into a single machine you need to quickly answer:

Is the user journey healthy right now?
Which service, deployment, or region is at fault?
Do we have enough capacity to handle load spikes?

Observability dashboards are the bridge between raw telemetry and action. They allow engineers to:

Visualize system health in real time (latency, errors, throughput).
Correlate incidents with deployments, versions, or infra changes.
Track ingest quality and costs to keep observability sustainable.

Why Dashboards Matter

Dashboards are often the first stop during an incident. Done well, they help on-call engineers answer “is the system healthy?” in seconds and drill into root causes without drowning in noise. Done poorly, they become cluttered wall art nobody trusts.

This guide shows what dashboards actually matter, which KPIs to plot, design patterns that reduce MTTR, and how to build them step-by-step in OpenObserve using real SQL queries.

Observability dashboards that matter (with KPIs to plot)

Dashboard	Primary audience	What it answers	KPIs / panels to include
Service health (golden signals)	SRE, on-call	“Is the user journey healthy right now?”	Request rate, error rate, p95/p99 latency, saturation (CPU/mem), % good events
Incident triage	SRE, DevOps	“What changed and where?”	Recent deploys, error bursts by service/version, top log patterns, hot endpoints, trace exemplars
Backend & infra	Platform	“Do we have capacity and headroom?”	Node/Pod health, throttling, restarts, disk/IO, queue depth, GC, network egress
Ingest quality	Observability owners	“Is telemetry complete and parseable?”	Ingest GB/day, dropped events, parsing errors, pipeline cost, stream cardinality
Cost & retention	Eng mgmt, FinOps	“What’s our spend driver?”	Ingest by team/namespace, query GB, top expensive queries, retention by stream

Start with service health + incident triage. Add ingest/cost once you’ve run for a week and see patterns.

Design patterns that work

One page = one decision. If a page can’t answer a single on-call question quickly, split it.
Progressive disclosure. Lead with high-level SLO charts; drill down to per-service, per-endpoint, per-pod details.
Time-boxed queries. Default to last 15–60 min; add quick links for 6h/24h when hunting regressions.
Consistent panels. Use the same axes, units, and color order across dashboards; muscle memory kills MTTR.
Link everything. Every KPI should have a click-through to logs and traces for the same time window + filters.

How to Create Dashboards in OpenObserve

If you’re new to dashboards in OpenObserve, start with the official guide on creating dashboards. It walks through the basics of building panels, setting time ranges, and customizing layouts.

Once you’ve covered the basics, the next section dives into practical patterns, SQL queries, and variables you can use to build production-ready dashboards for SRE, DevOps, and platform teams.

Designing Effective Observability Dashboards

Assumes you’re already sending logs/metrics/traces to OpenObserve. If not, dual-write from Fluent Bit/OTel Collector for a week, then cut over.

Define Variables and Filters

Variables let dashboards be reusable across teams, services, and environments. Typical examples:

Service: list of services in your system
Environment: prod / staging / dev
Namespace or Team: group by deployment or owner

These variables power the drop-downs so the same dashboard can serve multiple teams.

Using Environment Filters in OpenObserve Dashboards

Core Metrics and Queries

Below are starter queries you can copy, adapt, and turn into panels. Each example answers a real on-call question.

1. Service Health (Golden Signals)

Error rate (%):

SELECT histogram(_timestamp, '5m') AS ts,
       countIf(status_code >= 500) * 100.0 / count(*) AS error_rate
FROM default
WHERE service_name = '${Service}' AND environment = '${Environment}' AND event_type = 'http_request'
GROUP BY ts
ORDER BY ts

Latency (p95):

SELECT histogram(_timestamp, '5m') AS ts,
       approx_percentile(root_duration_ms, 0.95) AS p95_latency
FROM traces
WHERE service_name = '${Service}' AND environment = '${Environment}'
GROUP BY ts
ORDER BY ts

2. Incident Triage

Errors by service/version:

SELECT histogram(_timestamp, '5m') AS ts,
       service_name,
       version,
       countIf(status_code >= 500) AS error_count
FROM logs
WHERE environment = '${Environment}'
GROUP BY ts, service_name, version
ORDER BY ts

Top failing endpoints:

SELECT endpoint, count(*) AS failures
FROM traces
WHERE status_code >= 500
  AND environment = '${Environment}' 
AND span.kind='server' AND span.name = '/'
GROUP BY endpoint
ORDER BY failures DESC
LIMIT 10

3. Backend & Infra

Pod restarts:

SELECT histogram(_timestamp, '5m') AS ts,
       pod_name,
       max(restart_count) AS restarts
FROM kube_pod_metrics
WHERE namespace = '${Namespace}'
GROUP BY ts, pod_name
ORDER BY ts

CPU & Memory saturation:

SELECT histogram(_timestamp, '5m') AS ts,
       avg(cpu_usage) AS cpu,
       avg(memory_usage) AS mem
FROM kube_node_metrics
WHERE cluster = '${Cluster}'
GROUP BY ts
ORDER BY ts

4. Ingest Quality

Dropped or parsing errors:

SELECT histogram(_timestamp, '5m') AS ts,
       countIf(parse_error = true) AS parsing_errors
FROM ingest_logs
WHERE environment = '${Environment}'
GROUP BY ts
ORDER BY ts

Stream cardinality (distinct values):

SELECT count(DISTINCT user_id) AS unique_users
FROM logs ; -- say for past 1 hr

5. Cost & Retention

Ingest by team/namespace:

SELECT namespace,
       round(sum(bytes_ingested)/1024/1024/1024, 2) AS gb_ingested
FROM ingest_logs 
GROUP BY namespace
ORDER BY gb_ingested DESC
-- say for past 24 hrs

Each query becomes a dashboard panel. Use variables (${Service}, ${Environment}, ${Namespace}) to make them reusable across teams.

Panels to add immediately

SLO summary: uptime %, availability burn rate, p95 latency line.

Latency Panel in the OpenObserve UI

Errors by dimension: stacked area by service, version, k8s.namespace.
Hot endpoints: top 10 routes by latency and by error count.
Trace exemplars: sample of slow/error traces for the selected window.
Capacity: CPU, memory, and container restarts for the selected service.

CPU Utilization Panel in the OpenObserve UI

Ingest & cost (owners only): ingest GB/day, query GB/day, pipeline GB.

Alerts that are actually useful

Most teams drown in noisy alerts. The ones that matter are the ones tied directly to user impact or to the reliability of your observability pipeline. Start with these:

SLO burn rate (user-impacting):
- Page on-call when the 5-minute burn rate for a 2-hour SLO exceeds 2× (fast-burn).
- Open a ticket when the 1-hour burn rate stays above 1× (slow-burn).
  
  Learn how to set up SLO-based alerts in OpenObserve

SLO-based alert example

Error bursts (early warning): Trigger if the error rate suddenly spikes above normal for 5+ minutes. This helps catch issues before they turn into outages.
Ingest gaps (observability health): Alert if a production stream goes silent for 5 minutes. No data = flying blind during incidents.
Pipeline failures (data quality): Fire when parsing error rate crosses a defined threshold, so you catch broken log/metric pipelines before dashboards go dark.

Link-outs for fast triage

From error rate → logs filtered to level:error and the time window.
From latency panel → traces > p95 for the selected endpoint.
From deploy markers → the CI job link (store the URL in a log field).

Common mistakes (and quick fixes)

Everything on one page. Fix: split into health vs triage; keep < 12 panels per page.

Splitting Tabs in the OpenObserve Dashboard UI

Unbounded queries. Fix: default to 15–60 min; add presets; never ship “last 30 days” panels.

Avoiding Unbounded Queries in the OpenObserve Dashboard UI

Lack of standards. Fix: normalize labels/fields (env, service, namespace); enforce at ingest.

Enforcing standard labels/fields at ingest

Alert noise. Fix: use burn-rate + change-aware thresholds; review weekly.

Implementation checklist (OpenObserve)

Variables: env, service, namespace.
Data hygiene: ensure service, status, duration_ms, trace_id exist and are typed correctly.
Dashboards: create Service Health and Incident Triage first; add Ingest Quality later.
Drill-downs: add link-outs to logs/traces from every KPI.
Alerts: start with burn-rate + error bursts; add ingest/pipeline alerts for owners.
Review: 15 minutes each Friday, prune panels, re-baseline, retire unused charts.

Where to start?

If you’re setting this up today, begin small: stream one service into OpenObserve, create a Service Health dashboard, and add an Incident Triage view. Create panels, add variables and filters. You can also create Custom charts.

Once those are solid, expand into ingest quality and cost. That foundation will keep your observability useful instead of noisy wall art.

Next, you can tie dashboards to actionable burn-rate alerts and user-impact metrics. Read our SLO alerting guide.

Ready to put these principles into practice?
Sign up for an OpenObserve cloud account (14-day free trial) or visit our downloads page to self-host OpenObserve.

FAQs

1. What KPIs should I include on an SRE dashboard?

Most SRE dashboards focus on golden signals: request rate, error rate, latency (p95/p99), and saturation (CPU, memory, queue depth). For incidents, add deployment markers, top log patterns, and trace exemplars. Start with service health and incident triage panels, then layer on cost and ingest quality metrics.

2. How do I avoid noisy dashboards and alerts?

Limit panels per dashboard (<12 per page) and split by decision type.
Use time-boxed queries (15–60 min default) and pre-aggregate long-term data.
Tie alerts to user impact or pipeline health rather than raw thresholds.
Enforce consistent labels and ownership so every panel has a responsible team.

3. How can I make dashboards reusable across services and environments?

Use variables and filters like service, environment, and namespace in your queries. This enables one dashboard template to work for multiple teams or deployments. Drop-downs or filter widgets in OpenObserve allow easy switching between contexts.

4. How do I track observability pipeline health?

Monitor ingest quality metrics, such as:

Dropped or parsing errors per stream
Stream cardinality and throughput
Pipeline cost (GB/day)

Alerts on silent streams or elevated parsing errors help catch broken telemetry before dashboards go dark.

5. How do I calculate SLO burn-rate for dashboards and alerts?

Burn rate measures how fast you’re consuming your error budget:

burn_rate = (observed_bad_fraction) / (error_budget * (alert_window / slo_window))

Where observed_bad_fraction = bad_count / total_count in the alert window. Use this to trigger on-call paging for fast burns and tickets for slower trends. Include SLO targets and time windows in dashboards for clarity.

6. What are best practices for labeling metrics and logs?

Standardize labels across services (env, service, namespace).
Avoid high-cardinality tags (e.g., per-user IDs) in aggregated dashboards.
Normalize field names at ingest and validate in CI/CD pipelines to ensure consistent dashboards and queries.

7. How often should dashboards be reviewed?

Weekly or bi-weekly reviews help prune unused panels, recalibrate alert thresholds, and keep dashboards aligned with evolving production systems.

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

Monitoring AWS Bedrock: Collecting Logs & Metrics in OpenObserve

How to

AWSAIMetrics

Monitoring AWS Bedrock: Collecting Logs & Metrics in OpenObserve

Learn how to monitor AWS Bedrock with CloudWatch, Kinesis Firehose, and OpenObserve. Track latency, errors, token usage, and model performance in real-time.

Simran Kumari

2025-11-28

Building Modern Observability: Why Rust Powers the Next Generation of Monitoring Platforms

Engineering

OpenObserveObservability

Building Modern Observability: Why Rust Powers the Next Generation of Monitoring Platforms

Traditional monitoring tools fail when you need them most. Learn how Rust-powered observability platforms solve memory safety issues and deliver reliable, high-performance monitoring at 60-90% and lower cost.

Prometheus Alertmanager VS OpenObserve’s In-Built Alerting : Unified Alerting and Observability

Simplify Prometheus Alertmanager setups with OpenObserve -unified alerts for metrics, logs, and traces, no YAML required.

Faster MTTD and MTTR with OpenObserve: From Alert Fatigue to Intelligent Incidents

Learn how OpenObserve reduces Mean Time to Detect and Mean Time to Resolve through intelligent alert correlation, deduplication, and automated incident creation. Cut through alert fatigue with SLO-based prioritization and Actions automation.

Manas Sharma

2025-11-25

How to

KubernetesMicrosoftObservability

How to Monitor Azure Kubernetes Service (AKS) with OpenObserve: End-to-End Setup

Learn how to set up comprehensive AKS monitoring with OpenObserve. Deploy the OpenObserve Collector to capture logs, metrics, and traces from your Azure Kubernetes clusters. Get unified observability with significant cost savings compared to Azure Log Analytics.

How to Export Azure Monitor Metrics using OpenTelemetry to OpenObserve

Collect and export Azure Monitor metrics to OpenObserve using the OpenTelemetry Collector. Build real-time dashboards, query metrics, and set up SQL-based alerts for Azure VMs, AKS, and other resources.

Prometheus Metric Types (Counters, Gauges, Histograms, Summaries)

A clear, developer-focused guide to Prometheus metric types, when to use each one, and how OpenObserve enhances Prometheus by solving retention, scalability, and correlation limitations.

Azure Monitoring with Otel Collector and OpenObserve: Collect Logs & Metrics from Any Resource

Monitor Azure VMs, databases, storage, and networking with a single pipeline using Event Hub → OTel Collector → OpenObserve. Simplify logging & metrics.

Simran Kumari

2025-11-18

How to

OpentelemetryAWSOpenObserve

How to Send AWS Lambda Traces to OpenObserve Using ADOT (AWS Distro for OpenTelemetry)

Learn how to implement distributed tracing for AWS Lambda using the AWS Distro for OpenTelemetry (ADOT) layer. This step-by-step guide shows you how to automatically capture traces from AWS SDK calls and send them to OpenObserve without writing any instrumentation code. Get full visibility into your serverless applications with open standards.

ServiceNow Integration with OpenObserve: Automate Incident Creation from Alerts

Learn how to integrate ServiceNow with OpenObserve to automatically create incidents from alerts. Step-by-step guide covering webhook integration and openobserve actions with deduplication support.

Md Mosaraf,Manas Sharma

2025-11-14

Observability Dashboards: What to Show and How to Build Them

Observability Dashboards: What to Show and How to Build Them

What you will learn

The Importance of Observability Dashboards in Cloud Environments

Why Dashboards Matter

Observability dashboards that matter (with KPIs to plot)

Design patterns that work

How to Create Dashboards in OpenObserve

Designing Effective Observability Dashboards

Define Variables and Filters

Core Metrics and Queries

Panels to add immediately

Alerts that are actually useful

Link-outs for fast triage

Common mistakes (and quick fixes)

Implementation checklist (OpenObserve)

Where to start?

FAQs

1. What KPIs should I include on an SRE dashboard?

2. How do I avoid noisy dashboards and alerts?

3. How can I make dashboards reusable across services and environments?

4. How do I track observability pipeline health?

5. How do I calculate SLO burn-rate for dashboards and alerts?

6. What are best practices for labeling metrics and logs?

7. How often should dashboards be reviewed?

About the Author

Simran Kumari

Latest From Our Blogs

Monitoring AWS Bedrock: Collecting Logs & Metrics in OpenObserve

Building Modern Observability: Why Rust Powers the Next Generation of Monitoring Platforms

Prometheus Alertmanager VS OpenObserve’s In-Built Alerting : Unified Alerting and Observability

Faster MTTD and MTTR with OpenObserve: From Alert Fatigue to Intelligent Incidents

How to Monitor Azure Kubernetes Service (AKS) with OpenObserve: End-to-End Setup

How to Export Azure Monitor Metrics using OpenTelemetry to OpenObserve

Prometheus Metric Types (Counters, Gauges, Histograms, Summaries)

Azure Monitoring with Otel Collector and OpenObserve: Collect Logs & Metrics from Any Resource

How to Send AWS Lambda Traces to OpenObserve Using ADOT (AWS Distro for OpenTelemetry)

ServiceNow Integration with OpenObserve: Automate Incident Creation from Alerts