Table of Contents

observability-dashboards-hero.png

Observability Dashboards: What to Show and How to Build Them

What you will learn

  1. The dashboards most teams actually need (and the KPIs to plot)
  2. Design patterns that speed up triage and reduce noise
  3. Concrete OpenObserve examples (queries, variables, alerts)

The Importance of Observability Dashboards in Cloud Environments

In modern cloud-native environments, teams run distributed microservices across Kubernetes clusters, serverless functions, and managed infrastructure. While this improves scalability, it also adds complexity.

When something goes wrong, you can’t simply log into a single machine you need to quickly answer:

  • Is the user journey healthy right now?
  • Which service, deployment, or region is at fault?
  • Do we have enough capacity to handle load spikes?

Observability dashboards are the bridge between raw telemetry and action. They allow engineers to:

  1. Visualize system health in real time (latency, errors, throughput).
  2. Correlate incidents with deployments, versions, or infra changes.
  3. Track ingest quality and costs to keep observability sustainable.

Why Dashboards Matter

Dashboards are often the first stop during an incident. Done well, they help on-call engineers answer “is the system healthy?” in seconds and drill into root causes without drowning in noise. Done poorly, they become cluttered wall art nobody trusts.

This guide shows what dashboards actually matter, which KPIs to plot, design patterns that reduce MTTR, and how to build them step-by-step in OpenObserve using real SQL queries.

Observability dashboards that matter (with KPIs to plot)

Dashboard Primary audience What it answers KPIs / panels to include
Service health (golden signals) SRE, on-call “Is the user journey healthy right now?” Request rate, error rate, p95/p99 latency, saturation (CPU/mem), % good events
Incident triage SRE, DevOps “What changed and where?” Recent deploys, error bursts by service/version, top log patterns, hot endpoints, trace exemplars
Backend & infra Platform “Do we have capacity and headroom?” Node/Pod health, throttling, restarts, disk/IO, queue depth, GC, network egress
Ingest quality Observability owners “Is telemetry complete and parseable?” Ingest GB/day, dropped events, parsing errors, pipeline cost, stream cardinality
Cost & retention Eng mgmt, FinOps “What’s our spend driver?” Ingest by team/namespace, query GB, top expensive queries, retention by stream

Start with service health + incident triage. Add ingest/cost once you’ve run for a week and see patterns.

Design patterns that work

  1. One page = one decision. If a page can’t answer a single on-call question quickly, split it.
  2. Progressive disclosure. Lead with high-level SLO charts; drill down to per-service, per-endpoint, per-pod details.
  3. Time-boxed queries. Default to last 15–60 min; add quick links for 6h/24h when hunting regressions.
  4. Consistent panels. Use the same axes, units, and color order across dashboards; muscle memory kills MTTR.
  5. Link everything. Every KPI should have a click-through to logs and traces for the same time window + filters.

How to Create Dashboards in OpenObserve

If you’re new to dashboards in OpenObserve, start with the official guide on creating dashboards. It walks through the basics of building panels, setting time ranges, and customizing layouts.

Once you’ve covered the basics, the next section dives into practical patterns, SQL queries, and variables you can use to build production-ready dashboards for SRE, DevOps, and platform teams.

Designing Effective Observability Dashboards

Assumes you’re already sending logs/metrics/traces to OpenObserve. If not, dual-write from Fluent Bit/OTel Collector for a week, then cut over.

  1. Define Variables and Filters

Variables let dashboards be reusable across teams, services, and environments. Typical examples:

  • Service: list of services in your system
  • Environment: prod / staging / dev
  • Namespace or Team: group by deployment or owner

These variables power the drop-downs so the same dashboard can serve multiple teams.

Using Environment Filters in OpenObserve Dashboards

  1. Core Metrics and Queries

Below are starter queries you can copy, adapt, and turn into panels. Each example answers a real on-call question.

1. Service Health (Golden Signals)

  • Error rate (%):
SELECT histogram(_timestamp, '5m') AS ts,
       countIf(status_code >= 500) * 100.0 / count(*) AS error_rate
FROM default
WHERE service_name = '${Service}' AND environment = '${Environment}' AND event_type = 'http_request'
GROUP BY ts
ORDER BY ts
  • Latency (p95):
SELECT histogram(_timestamp, '5m') AS ts,
       approx_percentile(root_duration_ms, 0.95) AS p95_latency
FROM traces
WHERE service_name = '${Service}' AND environment = '${Environment}'
GROUP BY ts
ORDER BY ts

2. Incident Triage

  • Errors by service/version:
SELECT histogram(_timestamp, '5m') AS ts,
       service_name,
       version,
       countIf(status_code >= 500) AS error_count
FROM logs
WHERE environment = '${Environment}'
GROUP BY ts, service_name, version
ORDER BY ts
  • Top failing endpoints:
SELECT endpoint, count(*) AS failures
FROM traces
WHERE status_code >= 500
  AND environment = '${Environment}' 
AND span.kind='server' AND span.name = '/'
GROUP BY endpoint
ORDER BY failures DESC
LIMIT 10

3. Backend & Infra

  • Pod restarts:
SELECT histogram(_timestamp, '5m') AS ts,
       pod_name,
       max(restart_count) AS restarts
FROM kube_pod_metrics
WHERE namespace = '${Namespace}'
GROUP BY ts, pod_name
ORDER BY ts
  • CPU & Memory saturation:
SELECT histogram(_timestamp, '5m') AS ts,
       avg(cpu_usage) AS cpu,
       avg(memory_usage) AS mem
FROM kube_node_metrics
WHERE cluster = '${Cluster}'
GROUP BY ts
ORDER BY ts

4. Ingest Quality

  • Dropped or parsing errors:
SELECT histogram(_timestamp, '5m') AS ts,
       countIf(parse_error = true) AS parsing_errors
FROM ingest_logs
WHERE environment = '${Environment}'
GROUP BY ts
ORDER BY ts
  • Stream cardinality (distinct values):
SELECT count(DISTINCT user_id) AS unique_users
FROM logs ; -- say for past 1 hr

5. Cost & Retention

  • Ingest by team/namespace:
SELECT namespace,
       round(sum(bytes_ingested)/1024/1024/1024, 2) AS gb_ingested
FROM ingest_logs 
GROUP BY namespace
ORDER BY gb_ingested DESC
-- say for past 24 hrs

Each query becomes a dashboard panel. Use variables (${Service}, ${Environment}, ${Namespace}) to make them reusable across teams.

  1. Panels to add immediately

  • SLO summary: uptime %, availability burn rate, p95 latency line.

Latency Panel in the OpenObserve UI

  • Errors by dimension: stacked area by service, version, k8s.namespace.
  • Hot endpoints: top 10 routes by latency and by error count.
  • Trace exemplars: sample of slow/error traces for the selected window.
  • Capacity: CPU, memory, and container restarts for the selected service.

CPU Utilization Panel in the OpenObserve UI

  • Ingest & cost (owners only): ingest GB/day, query GB/day, pipeline GB.
  1. Alerts that are actually useful

Most teams drown in noisy alerts. The ones that matter are the ones tied directly to user impact or to the reliability of your observability pipeline. Start with these:

  • SLO burn rate (user-impacting):
    • Page on-call when the 5-minute burn rate for a 2-hour SLO exceeds (fast-burn).

    • Open a ticket when the 1-hour burn rate stays above (slow-burn).

          Learn how to [set up SLO-based alerts in OpenObserve](https://openobserve.ai/blog/slo-based-alerting/)
      

SLO-based alert example

  • Error bursts (early warning): Trigger if the error rate suddenly spikes above normal for 5+ minutes. This helps catch issues before they turn into outages.
  • Ingest gaps (observability health): Alert if a production stream goes silent for 5 minutes. No data = flying blind during incidents.
  • Pipeline failures (data quality): Fire when parsing error rate crosses a defined threshold, so you catch broken log/metric pipelines before dashboards go dark.
  1. Link-outs for fast triage

  • From error rate → logs filtered to level:error and the time window.
  • From latency panel → traces > p95 for the selected endpoint.
  • From deploy markers → the CI job link (store the URL in a log field).

Common mistakes (and quick fixes)

  • Everything on one page. Fix: split into health vs triage; keep < 12 panels per page.

Splitting Tabs in the OpenObserve Dashboard UI

  • Unbounded queries. Fix: default to 15–60 min; add presets; never ship “last 30 days” panels.

Avoiding Unbounded Queries in the OpenObserve Dashboard UI

  • Lack of standards. Fix: normalize labels/fields (env, service, namespace); enforce at ingest.

Enforcing standard labels/fields at ingest

  • Alert noise. Fix: use burn-rate + change-aware thresholds; review weekly.

Implementation checklist (OpenObserve)

  1. Variables: env, service, namespace.
  2. Data hygiene: ensure service, status, duration_ms, trace_id exist and are typed correctly.
  3. Dashboards: create Service Health and Incident Triage first; add Ingest Quality later.
  4. Drill-downs: add link-outs to logs/traces from every KPI.
  5. Alerts: start with burn-rate + error bursts; add ingest/pipeline alerts for owners.
  6. Review: 15 minutes each Friday, prune panels, re-baseline, retire unused charts.

Where to start?

If you’re setting this up today, begin small: stream one service into OpenObserve, create a Service Health dashboard, and add an Incident Triage view. Create panels, add variables and filters. You can also create Custom charts.

Once those are solid, expand into ingest quality and cost. That foundation will keep your observability useful instead of noisy wall art.

Next, you can tie dashboards to actionable burn-rate alerts and user-impact metrics. Read our SLO alerting guide.

Ready to put these principles into practice?
Sign up for an OpenObserve cloud account (14-day free trial) or visit our downloads page to self-host OpenObserve.

FAQs

1. What KPIs should I include on an SRE dashboard?

Most SRE dashboards focus on golden signals: request rate, error rate, latency (p95/p99), and saturation (CPU, memory, queue depth). For incidents, add deployment markers, top log patterns, and trace exemplars. Start with service health and incident triage panels, then layer on cost and ingest quality metrics.

2. How do I avoid noisy dashboards and alerts?

  • Limit panels per dashboard (<12 per page) and split by decision type.
  • Use time-boxed queries (15–60 min default) and pre-aggregate long-term data.
  • Tie alerts to user impact or pipeline health rather than raw thresholds.
  • Enforce consistent labels and ownership so every panel has a responsible team.

3. How can I make dashboards reusable across services and environments?

Use variables and filters like service, environment, and namespace in your queries. This enables one dashboard template to work for multiple teams or deployments. Drop-downs or filter widgets in OpenObserve allow easy switching between contexts.

4. How do I track observability pipeline health?

Monitor ingest quality metrics, such as:

  • Dropped or parsing errors per stream
  • Stream cardinality and throughput
  • Pipeline cost (GB/day)

Alerts on silent streams or elevated parsing errors help catch broken telemetry before dashboards go dark.

5. How do I calculate SLO burn-rate for dashboards and alerts?

Burn rate measures how fast you’re consuming your error budget:

burn_rate = (observed_bad_fraction) / (error_budget * (alert_window / slo_window))

Where observed_bad_fraction = bad_count / total_count in the alert window. Use this to trigger on-call paging for fast burns and tickets for slower trends. Include SLO targets and time windows in dashboards for clarity.

6. What are best practices for labeling metrics and logs?

  • Standardize labels across services (env, service, namespace).
  • Avoid high-cardinality tags (e.g., per-user IDs) in aggregated dashboards.
  • Normalize field names at ingest and validate in CI/CD pipelines to ensure consistent dashboards and queries.

7. How often should dashboards be reviewed?

Weekly or bi-weekly reviews help prune unused panels, recalibrate alert thresholds, and keep dashboards aligned with evolving production systems.

About the Author

Simran Kumari

Simran Kumari

LinkedIn

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts