Table of Contents

metrics-dashboards-for-sre-devops-hero.png

How to Build Metrics Dashboards for SRE & DevOps Teams

When you’re on-call as an SRE or running ops for your team, dashboards are your cockpit. They’re where you go first when someone pings “is something wrong?”

The seven types of dashboards we’ll cover include:

  1. Docker Metrics Dashboard
  2. Jenkins Metrics Dashboard
  3. Kubernetes Metrics Dashboard
  4. GitHub Metrics Dashboard
  5. Argo CD Metrics Dashboard
  6. Prometheus Metrics Dashboard
  7. Host Metrics Dashboard

The right dashboard answers the right questions quickly. Is the build pipeline healthy? Did that deployment succeed? Are containers and hosts running out of resources? Is the cluster stable?

Let’s walk through the most common systems you’ll need to monitor. For each dashboard type, we’ll cover:

  1. Why does the dashboard matter?
  2. What should the dashboard help you answer?
  3. What key metrics should you focus on?

Note: A link to a prebuilt JSON dashboard that you can import right away is included for each of the seven dashboards covered.

If you’re new to building dashboards, start with our Observability Dashboards primer.

Docker Metrics Dashboard

When a service slows down or fails, one of the first places to check is the containers running it. Containers form the backbone of modern deployments, and if they are starved for CPU, memory, or I/O, the issues quickly ripple through your applications. You might see pods restarting, requests timing out, or services becoming sluggish symptoms that often look like application bugs but are actually resource-related.

A Docker dashboard helps you identify these issues immediately. It should give you a clear view of which containers are healthy, which are under stress, and where bottlenecks may be forming.

You should focus on the metrics that reveal resource pressure: CPU usage, memory usage, disk I/O, and network throughput. These metrics help answer questions such as:

  • Which containers are consuming the most CPU or memory right now?
  • Are any containers hitting disk or network limits?
  • Is resource usage balanced across containers?

Recommended Docker dashboard panels

  1. CPU Utilization by Container (line graph): Shows CPU usage % for each container to spot high consumers.

  2. CPU Time Consumed (rate): Tracks CPU nanoseconds consumed, useful for workload intensity over time.

  3. Memory Usage % (line graph): Highlights containers under memory pressure.

  4. Memory Usage without Cache: Helps detect memory leaks or excessive growth.

  5. Network I/O – Received / Transmitted (line graphs): Spot containers saturating inbound or outbound network traffic.

  6. Disk I/O (read/write bytes per container): Identify containers causing storage bottlenecks.

By combining these panels, you can monitor container health in real time, detect potential problems before they impact users, and make informed decisions on resource allocation.

Prebuilt Docker Metrics Dashboard

Prebuilt Docker Dashboard JSON

Jenkins Metrics Dashboard

CI/CD pipelines are the engines of modern software delivery. When builds fail or slow down, it impacts deployments, releases, and ultimately your users. You need a dashboard that tells you at a glance which jobs are healthy, which are failing, and where the bottlenecks are in your build pipeline.

A well-built Jenkins dashboard helps you answer questions like:

  1. How long are my pipelines taking to run, and are they slowing down?
  2. How many pipeline jobs are currently active, launched, or completed?
  3. Do I have enough executors available, or are they becoming a bottleneck?
  4. Are executors sitting idle, or is Jenkins fully utilized?

Recommended Jenkins dashboard panels

  1. Pipeline Run Duration (line chart) → Spot trends in how long pipelines take to finish. Useful for detecting gradual slowdowns or regressions after deployments.
  2. Active Pipeline Jobs (gauge) → See at a glance how many jobs are actively running right now.
  3. Pipeline Jobs Launched / Started (single stat) → Track how many jobs Jenkins is processing over time, giving you volume insights.
  4. Available Executors (gauge) → Know how many executors are ready to pick up jobs.
  5. Busy Executors (gauge) → Detect executor saturation when all are busy and jobs may get stuck in the queue.
  6. Idle Executors (gauge) → Ensure you aren’t over-provisioning resources that sit unused.
  7. Online Executors (gauge) → Validate that your Jenkins agents are healthy and connected.

By keeping these panels visible, you can react quickly to failing jobs, balance executor resources, and prevent slowdowns in your CI/CD process.

Prebuilt Jenkins Metrics Dashboard Executor Overview 1 of 2

Prebuilt Jenkins Metrics Dashboard Executor Overview 2 of 2

Prebuilt Jenkins Dashboard JSON

Kubernetes Metrics Dashboard

When managing a Kubernetes cluster, ensuring the health of nodes, namespaces, and pods is critical. Kubernetes orchestrates containerized workloads, and if resources such as CPU, memory, or storage become constrained, the impact quickly cascades across applications and services. Symptoms like pod restarts, slow response times, or failed deployments are often caused by underlying resource issues.

A Kubernetes dashboard helps you visualize these problems in real time. It should give you visibility across cluster layers from nodes and namespaces down to individual pods and events, so you can detect bottlenecks early and keep workloads stable.

You should focus on metrics that reflect cluster health and workload stability, such as CPU and memory usage, pod restarts, scheduling failures, and node pressure conditions. These metrics help answer questions like:

  • Are any nodes under CPU or memory pressure?
  • Which pods or namespaces are consuming the most resources?
  • Are there recurring restarts or failures indicating instability?
  • Have there been recent events pointing to scheduling or resource issues?

Recommended Kubernetes dashboard panels

  1. CPU & Memory Usage (per Node / per Namespace): Spot resource hotspots and imbalances.
  2. Pod Resource Usage: Track container-level CPU and memory to detect leaks or runaway processes.
  3. Node Pressure Indicators: Monitor when nodes report CPU, memory, or disk pressure.
  4. Network & Disk I/O: Identify workloads causing storage or network bottlenecks.
  5. Pod Restart Counts: Catch crash loops before they affect application stability.
  6. Kubernetes Events: Surface warnings and errors to troubleshoot scheduling or deployment failures.

By combining these panels, you can monitor Kubernetes health across all layers, detect issues early, and ensure that workloads are running smoothly and efficiently.

Prebuilt Kubernetes Dashboard

Prebuilt Kubernetes Dashboard JSON

GitHub Metrics Dashboard

Repositories are the source of truth for your codebase, and monitoring their activity helps maintain healthy development workflows. Issues like stalled merges, inactive repositories, or sudden spikes in changes can slow down productivity even if builds aren’t failing.

A GitHub dashboard helps you visualize repository-level activity, track contribution patterns, and detect bottlenecks early. It provides visibility into merges, references, and overall repository health so you can ensure smooth collaboration across teams.

You should focus on metrics that reveal repository activity and efficiency: repository counts, change frequency, merge times, and reference activity. These metrics help answer questions such as:

  • Which repositories are experiencing the most changes?
  • How long do merges typically take, and are they increasing?
  • Are certain repositories carrying more activity or complexity than others?
  • Are references and line deltas stable, or are they showing churn?

Recommended GitHub dashboard panels

  1. Repository Count (stat panel): Monitor the number of repositories tracked over time.
  2. Top 10 Repos by Changes (bar chart): Identify repositories with the most activity.
  3. Reference Count by Repo (bar chart): Track branch and tag references per repository.
  4. Repository List (table): View repository names, organizations, and links for quick navigation.
  5. Merge Over Time (area chart): Measure how long merges take across repositories.
  6. Reference Lines Delta (area chart): Spot churn in code changes across repositories.

By combining these panels, you can monitor GitHub repository health in real time, detect potential process or collaboration issues, and make informed decisions to improve your team’s development workflow.

Prebuilt GitHub Repo Dashboard 1 of 2 - Repo List

Prebuilt GitHub Repo Dashboard 2 of 2 - Activity Details

Prebuilt Github Dashboard JSON

Argo CD Metrics Dashboard

Continuous delivery pipelines rely on Argo CD to ensure that the desired state of your Kubernetes clusters matches what’s declared in Git. When deployments drift or applications degrade, it can directly impact availability.

An Argo CD dashboard gives immediate insight into application health, synchronization status, and overall stability. It shows which applications are healthy, degraded, or out of sync, and provides a detailed view of each app.

Focus on metrics that reveal deployment and cluster health: application sync status, health checks, and problem detection. These metrics help answer questions such as:

  • Which applications are healthy, degraded, or out of sync?
  • Which applications require intervention due to repeated failures?
  • How many applications are currently synced versus out of sync?

Recommended Argo CD dashboard panels

  1. Application Health Overview (stat panels): Quickly see the number of healthy, degraded, or out-of-sync applications (Total Apps, Healthy Apps, Unhealthy Apps, Degraded Apps, Out of Sync Apps, Synced Apps).

  2. App Status (table): Track the sync and health status per application including namespace, name, health, and sync state.

  3. Redis Operations (bar chart): Monitor Argo CD Redis request activity by initiator to spot operational bottlenecks.

  4. K8s API Operations (line chart): Track Kubernetes API request rates by resource kind to identify spikes or unusual activity.

Prebuilt ArgoCD Dashboard

Prebuilt ArgoCD Dashboard JSON

Prometheus Metrics Dashboard

Prometheus powers your monitoring stack, but even Prometheus itself needs monitoring. If the Prometheus server is down, scraping targets fail, or alert rules are misconfigured, you won’t see critical metrics leaving your observability blind.

The OpenObserve Prometheus dashboard lets you monitor the Prometheus server directly. It provides real-time visibility into server health, scrape performance, rule evaluations, and storage usage, ensuring that your monitoring pipeline is reliable.

Focus on metrics that reveal the Prometheus server’s health and reliability:

  • Are all scrape targets up and responding?
  • Are scrapes completing on time, or are there slow or failing endpoints?
  • Are alerting rules executing successfully, or are there errors?
  • How much storage is Prometheus using, and is retention configured correctly?

Recommended Prometheus dashboard panels

  1. Target Availability (stat panel or bar chart): Quickly see which scrape targets are up or down.
  2. Scrape Duration (line chart): Detect slow or failing scrapes that could delay metrics collection.
  3. Rule Evaluation Success/Failure (bar chart): Track the execution of alerting rules to catch errors early.
  4. Alerts Firing (table or line chart): Monitor active alerts in real time to respond promptly.
  5. Storage Usage (line chart or gauge): Keep track of Prometheus disk usage and retention trends.
  6. Top Failing Targets (table): Identify frequently failing scrape targets for immediate action.

Prebuilt Prometheus Dashboard JSON

Host Metrics Dashboard

Even with containers, Kubernetes, and CI/CD pipelines, the foundation of your infrastructure is the host itself. CPU spikes, memory pressure, disk saturation, or network congestion at the host level can cause cascading failures across all your services. Monitoring hosts ensures you catch these issues before they affect applications.

The Host Metrics dashboard provides visibility into the health and resource usage of your servers or VMs. It helps you answer questions such as:

  • Are any hosts running out of CPU, memory, or disk space?
  • Are there unusual spikes in network traffic that could affect service performance?
  • Which hosts are under the most load, and are resources balanced across the cluster?
  • Are there recurring patterns that indicate potential hardware or configuration issues?

Recommended Host Metrics dashboard panels

  1. CPU Usage per Host (line chart): Detect hosts under CPU pressure before they throttle containers or processes.
  2. Memory Usage per Host (stacked area chart): Identify hosts nearing memory limits or showing memory leaks.
  3. Disk Usage and I/O (line chart for read/write): Spot storage bottlenecks or failing disks early.
  4. Network Traffic per Host (line chart for in/out): Detect network saturation or unusual traffic spikes.
  5. Top Resource-Consuming Hosts (table): Quickly see which servers are using the most resources.
  6. Host Health Overview (stat panel or pie chart): Snapshot of healthy vs overloaded hosts.

By monitoring hosts alongside containers, applications, and Prometheus, you gain a full-stack view of your environment. This ensures that resource issues at the infrastructure level don’t go unnoticed and provides early warning before they impact users.

Prebuilt Host Metrics Dashboard

Prebuilt Host Metrics dashboard JSON

Conclusion

Dashboards are the central tool for SREs, DevOps engineers, and developers to maintain system health, detect issues early, and make data-driven decisions. By combining prebuilt dashboards for Docker, Kubernetes, Jenkins, GitHub, Argo CD, Prometheus, and host metrics, you get full-stack visibility across applications, infrastructure, and CI/CD pipelines.

Next Steps

  1. Import Prebuilt Dashboards
  2. Adjust panels to match your environment, thresholds, and team priorities.
  3. Set up alerts based on critical metrics to proactively respond to incidents.

Ready to put this into practice? Sign up for an OpenObserve cloud account (14 day free trial) or visit our downloads page to self host OpenObserve.

About the Author

Simran Kumari

Simran Kumari

LinkedIn

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts