NVIDIA GPU Monitoring with DCGM Exporter and OpenObserve: Complete Setup Guide

Chaitanya Sistla

November 06, 2025

9 min read

Don’t forget to share!

Table of Contents

AI-driven infrastructure landscape is evolving and GPU clusters represent one of the most significant capital investments for organizations. Whether you're running large language models, training deep learning models, or processing massive datasets, your NVIDIA GPUs (H100s, H200s, A100s, or L40S) are the workhorses powering your most critical workloads.

But here's the challenge: how do you know if your GPU infrastructure is performing optimally?

Traditional monitoring approaches fall short when it comes to GPU infrastructure. System metrics like CPU and memory utilization don't tell you if your GPUs are thermal throttling, experiencing memory bottlenecks, or operating at peak efficiency. You need deep visibility into GPU-specific metrics like utilization, temperature, power consumption, memory usage, and PCIe throughput.

This is where NVIDIA's Data Center GPU Manager (DCGM) Exporter combined with OpenObserve creates a powerful, cost-effective monitoring solution that gives you real-time insights into your GPU infrastructure.

Why GPU Monitoring Matters

The High Cost of GPU Inefficiency

Consider this scenario: You're running an 8x NVIDIA H200 cluster. Each H200 costs approximately $30,000-$40,000, meaning your hardware investment alone is around $240,000-$320,000. Operating costs (power, cooling, infrastructure) can easily add another $50,000-$100,000 annually.

Now imagine:

Thermal throttling reducing performance by 15% due to poor cooling
GPU memory leaks causing jobs to fail silently
Underutilization with GPUs sitting idle 40% of the time
Hardware failures going undetected until complete outage
PCIe bottlenecks limiting data transfer rates

Without proper monitoring, you're flying blind. You might be:

Wasting $50,000+ annually on inefficient GPU utilization
Missing critical performance degradation before it impacts production
Unable to justify ROI on GPU infrastructure to stakeholders
Lacking data for capacity planning and optimization decisions

What You Need to Monitor

Effective GPU monitoring requires tracking dozens of metrics across multiple dimensions:

Performance Metrics:

GPU compute utilization (%)
Memory bandwidth utilization (%)
Tensor Core utilization
SM (Streaming Multiprocessor) occupancy

Thermal & Power:

GPU temperature (°C)
Power consumption (W)
Power limit throttling events
Thermal throttling events

Memory:

GPU memory usage (MB/GB)
Memory allocation failures
ECC (Error Correction Code) errors
Memory clock speeds

Interconnect:

PCIe throughput (TX/RX)
NVLink bandwidth
NVSwitch fabric health
Data transfer bottlenecks

Health & Reliability:

XID errors (hardware faults)
Page retirement events
GPU compute capability
Driver version compliance

The Solution: DCGM Exporter + OpenObserve

What is DCGM Exporter?

NVIDIA's Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs. DCGM Exporter exposes GPU metrics in Prometheus format, making it easy to integrate with modern observability platforms.

You can find more details about DCGM exporter here.

Key capabilities:

Exposes 40+ GPU metrics per device
Supports all modern NVIDIA datacenter GPUs (A100, H100, H200, L40S)
Low overhead monitoring (~1% GPU utilization)
Works with Docker, Kubernetes, and bare metal
Handles multi-GPU and multi-node deployments
Provides health diagnostics and error detection

Complete Setup Guide

Prerequisites

Before starting, ensure you have:

GPU-enabled server (cloud or on-premises)
NVIDIA GPUs installed and recognized by the system
NVIDIA drivers version 535+ (550+ recommended for H200)
Docker installed and configured with NVIDIA Container Toolkit
OpenObserve instance (cloud or self-hosted)

Step 1: Verify GPU Detection

First, confirm your GPUs are properly detected by the system:

# Check if GPUs are visible
nvidia-smi

# Expected output: List of GPUs with utilization, temperature, and memory

For NVIDIA H200 or multi-GPU systems with NVSwitch, you'll need the NVIDIA Fabric Manager:

# Install fabric manager (version should match your driver)
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-fabricmanager-535

# Reboot to load new driver
sudo reboot

# After reboot, start the service
sudo systemctl start nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager

# Verify
nvidia-smi  # Should now show all GPUs

Step 2: Deploy DCGM Exporter

Deploy DCGM Exporter as a Docker container. This lightweight container exposes GPU metrics on port 9400:

docker run -d \
  --gpus all \
  --cap-add SYS_ADMIN \
  --network host \
  --name dcgm-exporter \
  --restart unless-stopped \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04

Configuration breakdown:

--gpus all - Grants access to all GPUs on the host
--cap-add SYS_ADMIN - Required for DCGM to query GPU metrics
--network host - Uses host networking for easier access
--restart unless-stopped - Ensures resilience across reboots

Verify DCGM is working:

# Wait 10 seconds for initialization
sleep 10

# Access metrics from inside the container
docker exec dcgm-exporter curl -s http://localhost:9400/metrics | head -30

# You should see output like:
# DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-xxxx",...} 45.0
# DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-xxxx",...} 42.0

Step 3: Configure OpenTelemetry Collector

The OpenTelemetry Collector scrapes metrics from DCGM Exporter and forwards them to OpenObserve. Create the configuration:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'dcgm-gpu-metrics'
          scrape_interval: 30s
          static_configs:
            - targets: ['localhost:9400']
          metric_relabel_configs:
            # Keep only DCGM metrics
            - source_labels: [__name__]
              regex: 'DCGM_.*'
              action: keep

exporters:
  otlphttp/openobserve:
    endpoint: https://example.openobserve.ai/api/ORG_NAME/
    headers:
      Authorization: "Basic YOUR_O2_TOKEN"

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlphttp/openobserve]

Get your OpenObserve credentials:

# For Ingestion token authentication (recommended):
Go to OpenObserve UI → Datasources -> Custom -> Otel Collector

openobserve ingestion token

Update the Authorization header in the config with your base64-encoded credentials.

Step 4: Deploy OpenTelemetry Collector

docker run -d \
  --network host \
  -v $(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
  --name otel-collector \
  --restart unless-stopped \
  otel/opentelemetry-collector-contrib:latest \
  --config=/etc/otel-collector-config.yaml

Check OpenTelemetry Collector:

# View collector logs
docker logs otel-collector

# Look for successful scrapes (no error messages)

Check OpenObserve:

Log into OpenObserve UI
Navigate to Metrics section
Search for metrics starting with DCGM_
Data should appear within 1-2 minutes

dcgm metrics list

Step 5: Generate GPU Load (Optional)

To verify monitoring is working, generate some GPU activity:

# Install PyTorch
pip3 install torch

# Create a load test script
cat > gpu_load.py <<'EOF'
import torch
import time

print("Starting GPU load test...")
devices = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
tensors = [torch.randn(15000, 15000, device=d) for d in devices]

print(f"Loaded {len(devices)} GPUs")
while True:
    for tensor in tensors:
        _ = torch.mm(tensor, tensor)
    time.sleep(0.5)
EOF

# Run load test
python3 gpu_load.py

Watch your metrics in OpenObserve - you should see GPU utilization spike!

Creating Dashboards in OpenObserve

Download the Dashboards from our community repository.
In OpenObserve UI, go to Dashboards → Import -> Drop your files here -> select your json -> Import

steps to show how to import dashboards 3. Once the dashboard has been imported, you will see the below metrics that were prebuilt and you can always customize the dashboards as needed.

Setting Up Alerts

Critical alerts to configure in OpenObserve:

1. High GPU Temperature

DCGM_FI_DEV_GPU_TEMP > 85

Severity: Warning at 85°C, Critical at 90°C
Action: Check cooling systems, reduce workload

2. GPU Memory Near Capacity

(DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.90

Severity: Warning at 90%, Critical at 95%
Action: Optimize memory usage or scale horizontally

3. Low GPU Utilization (Waste Detection)

avg(DCGM_FI_DEV_GPU_UTIL) < 20

Duration: For 30 minutes
Action: Review workload scheduling, consider rightsizing

4. GPU Hardware Errors

increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0

Severity: Critical
Action: Immediate investigation, potential RMA

5. Thermal Throttling Detected

increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0

Severity: Warning
Action: Improve cooling or reduce ambient temperature

6. GPU Offline

absent(DCGM_FI_DEV_GPU_TEMP)

Duration: For 2 minutes
Action: Check GPU health, driver status, fabric manager

Traditional Monitoring vs. GPU Monitoring with OpenObserve

Aspect	Traditional Monitoring (Prometheus/Grafana)	OpenObserve for GPU Monitoring
Setup Complexity	Requires Prometheus, node exporters, Grafana, storage backend, and complex configuration	Single unified platform with built-in visualization
Storage Costs	High - Prometheus stores all metrics at full resolution, requires expensive SSD storage	80% lower - Advanced compression and columnar storage
Multi-tenancy	Complex setup requiring multiple Prometheus instances or federation	Built-in with organization isolation and access controls
Alerting	Separate alerting system (Alertmanager), complex routing configuration	Integrated alerting with flexible notification channels
Long-term Retention	Expensive - requires additional tools like Thanos or Cortex	Native long-term storage with automatic data lifecycle management
GPU-Specific Features	Generic time-series database, not optimized for GPU metrics	Optimized for high-cardinality workloads like GPU monitoring
Log Correlation	Separate log management system needed (ELK, Loki)	Unified logs, metrics, and traces in one platform
Setup Time	4-8 hours (multiple components, configurations, troubleshooting)	30 minutes (end-to-end)
Maintenance Overhead	High - multiple systems to update, monitor, and troubleshoot	Low - single platform with automatic updates

ROI Examples

For an 8-GPU H200 cluster worth $320,000:

Detect thermal throttling early:

15% performance loss = $48,000 annual waste
Early detection saves this loss
ROI: 990% in first year

Optimize utilization:

Increase from 40% to 70% = 75% more work
Defer $240,000 expansion by 1 year
ROI: 4,900% in first year

Prevent downtime:

1 hour downtime = $2,800 revenue loss
Preventing 5 hours/year = $14,000 saved
ROI: 289% in first year

Conclusion

GPU monitoring is no longer optional—it's essential infrastructure for any organization running GPU workloads. The combination of DCGM Exporter and OpenObserve provides:

✅ Complete visibility into GPU health, performance, and utilization
✅ Cost optimization through identifying waste and inefficiencies
✅ Proactive alerting to prevent outages and degradation
✅ Data-driven decisions for capacity planning and architecture
✅ 89% lower TCO compared to traditional monitoring stacks
✅ 30-minute setup vs. days with traditional tools

Whether you're running AI/ML workloads, rendering farms, scientific computing, or GPU-accelerated databases, this monitoring solution delivers immediate ROI while scaling effortlessly as your infrastructure grows.

Resources

DCGM Exporter: github.com/NVIDIA/dcgm-exporter
OpenObserve: openobserve.ai
OpenObserve Docs: openobserve.ai/docs
OpenTelemetry Collector: opentelemetry.io/docs/collector

Get Started with OpenObserve Today!

Sign up for a 14 day trial Check out our GitHub repository for self-hosting and contribution opportunities

About the Author

Chaitanya Sistla

Chaitanya Sistla is a Principal Solutions Architect with 17X certifications across Cloud, Data, DevOps, and Cybersecurity. Leveraging extensive startup experience and a focus on MLOps, Chaitanya excels at designing scalable, innovative solutions that drive operational excellence and business transformation.

Latest From Our Blogs

View all posts

Prometheus Alertmanager VS OpenObserve’s In-Built Alerting : Unified Alerting and Observability

Engineering

Alerts

Prometheus Alertmanager VS OpenObserve’s In-Built Alerting : Unified Alerting and Observability

Simplify Prometheus Alertmanager setups with OpenObserve -unified alerts for metrics, logs, and traces, no YAML required.

Faster MTTD and MTTR with OpenObserve: From Alert Fatigue to Intelligent Incidents

Learn how OpenObserve reduces Mean Time to Detect and Mean Time to Resolve through intelligent alert correlation, deduplication, and automated incident creation. Cut through alert fatigue with SLO-based prioritization and Actions automation.

Manas Sharma

2025-11-25

How to

KubernetesMicrosoftObservability

How to Monitor Azure Kubernetes Service (AKS) with OpenObserve: End-to-End Setup

Learn how to set up comprehensive AKS monitoring with OpenObserve. Deploy the OpenObserve Collector to capture logs, metrics, and traces from your Azure Kubernetes clusters. Get unified observability with significant cost savings compared to Azure Log Analytics.

How to Export Azure Monitor Metrics using OpenTelemetry to OpenObserve

Collect and export Azure Monitor metrics to OpenObserve using the OpenTelemetry Collector. Build real-time dashboards, query metrics, and set up SQL-based alerts for Azure VMs, AKS, and other resources.

Prometheus Metric Types (Counters, Gauges, Histograms, Summaries)

A clear, developer-focused guide to Prometheus metric types, when to use each one, and how OpenObserve enhances Prometheus by solving retention, scalability, and correlation limitations.

Azure Monitoring with Otel Collector and OpenObserve: Collect Logs & Metrics from Any Resource

Monitor Azure VMs, databases, storage, and networking with a single pipeline using Event Hub → OTel Collector → OpenObserve. Simplify logging & metrics.

Simran Kumari

2025-11-18

How to

OpentelemetryAWSOpenObserve

How to Send AWS Lambda Traces to OpenObserve Using ADOT (AWS Distro for OpenTelemetry)

Learn how to implement distributed tracing for AWS Lambda using the AWS Distro for OpenTelemetry (ADOT) layer. This step-by-step guide shows you how to automatically capture traces from AWS SDK calls and send them to OpenObserve without writing any instrumentation code. Get full visibility into your serverless applications with open standards.

ServiceNow Integration with OpenObserve: Automate Incident Creation from Alerts

Learn how to integrate ServiceNow with OpenObserve to automatically create incidents from alerts. Step-by-step guide covering webhook integration and openobserve actions with deduplication support.

Md Mosaraf,Manas Sharma

2025-11-14

Full-Stack Observability: How Logs, Metrics, and Traces Work Better Together

Engineering

LoggingMetricsOpenObserve

Full-Stack Observability: How Logs, Metrics, and Traces Work Better Together

Discover how full-stack observability helps teams correlate telemetry across systems to cut MTTR, reduce data costs, and improve performance.

Raven Welch,Simran Kumari

2025-11-13

Cloud Monitoring for AWS, Azure, and GCP with OpenObserve

Engineering

AWSGCPMicrosoft

Cloud Monitoring for AWS, Azure, and GCP with OpenObserve

Discover how to monitor cloud resources effectively. Use OpenObserve to analyze logs, metrics, and traces for better visibility, alerts, and performance.

Simran Kumari

2025-11-12