Blog

What You Need to Know About Prometheus Metrics: Architecture, Collection, and Optimization for Scalable Observability

November 12, 2024 by Chaitanya Sistla
what-you-need-to-know-about-prometheus-architecture

Monitoring and observability have become critical aspects of modern DevOps and SRE practices. Prometheus, one of the most popular open-source monitoring solutions, has proven invaluable in enabling real-time monitoring, alerting, and data visualization. In this guide, we’ll explore the full workflow of Prometheus metrics, from setting up Prometheus to ingesting data, processing it, and visualizing it. By the end of this guide, you’ll have a clear understanding of how to integrate and leverage Prometheus metrics for observability in your system.

1. Prometheus Architecture

To understand Prometheus fully, it’s essential to explore its architecture and how each component contributes to its functionality.

1.1 Core Components

  • Prometheus Server: The main server that scrapes and stores metrics, processes rules, and runs queries.
  • Exporters: Small programs that expose metrics from external sources, like the OS or databases, in a Prometheus-readable format.
  • Alertmanager: Handles alerts generated by Prometheus rules and routes them to various receivers (email and Slack).
  • Service Discovery: Helps Prometheus locate and add monitoring targets dynamically in environments with dynamic infrastructure, like Kubernetes.

1.2 Data Flow Overview

Prometheus collects metrics by scraping endpoints at configured intervals. The data is then stored in a time-series database, which supports a variety of operations, including aggregations and mathematical computations. Prometheus uses PromQL to query stored data and analyze system performance.

2. Setting Up Metrics Collection and Exporters

In Prometheus, exporters are used to gather metrics from various sources, such as system hardware, applications, and databases, exposing them in a format that Prometheus can read and scrape.

2.1 Installing Node Exporter for System Metrics

Node Exporter is commonly used to gather system-level metrics like CPU, memory, and disk usage. Here’s how to set it up:

  1. Download and Install Node Exporter:

    To find the available releases, go to https://github.com/prometheus/node_exporter/releases
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
tar -xvf node_exporter-1.2.2.linux-amd64.tar.gz
cd node_exporter-1.2.2.linux-amd64
  1. Run Node Exporter:

./node_exporter

Node Exporter, by default, exposes metrics at localhost:9100/metrics

node exporter

  1. Configure Prometheus to Scrape Node Exporter:

    Create prometheus.yml to include Node Exporter as a scrape target in the same directory:
global:
  scrape_interval: 15s  # Set the default scrape interval

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

This configuration instructs Prometheus to scrape metrics from Node Exporter at localhost:9100.

2.2 Custom Application Metrics

To monitor specific aspects of an application’s performance, you can instrument custom metrics within your application. Below is an example using Python.

  1. Install the Prometheus Client Library:

pip install prometheus_client
  1. Create a Simple Python Script to Expose Metrics:

    This script exposes two types of metrics: a counter for the number of requests and a gauge for request latency.
from prometheus_client import start_http_server, Counter, Gauge
import time
import random

REQUEST_COUNTER = Counter('app_request_count', 'Number of requests received')
REQUEST_LATENCY = Gauge('app_request_latency_seconds', 'Latency of requests in seconds')

def process_request():
    REQUEST_COUNTER.inc()
    with REQUEST_LATENCY.time():
        time.sleep(random.uniform(0.1, 1.0))

if __name__ == '__main__':
    start_http_server(8000)  # Start a Prometheus metrics endpoint
    while True:
        process_request()
  1. Run the Script:

python your_script.py
  1. Add the Application as a Scrape Target in Prometheus:

    Update prometheus.yml to include your application as a scrape target
scrape_configs:
  - job_name: 'my_python_app'
    static_configs:
      - targets: ['localhost:8000']

3. Ingesting Prometheus Metrics into OpenObserve

Why Ingest Prometheus Metrics into OpenObserve?

Prometheus is perfect for quick, real-time metrics, but as systems grow, storing, scaling, and analyzing long-term data becomes more challenging. OpenObserve steps in as a powerful companion, allowing you to keep Prometheus metrics over a longer period and scale effortlessly without complex setups or storage limitations. By sending metrics from Prometheus to OpenObserve, you retain the flexibility of Prometheus for instant monitoring while gaining a scalable backend for deeper, historical insights and advanced analytics.

With OpenObserve, your observability stack is ready to scale alongside your infrastructure, ensuring smooth, reliable performance as your systems grow.

prometheus o2

Once your system and application metrics are configured, you can set up Remote Write in Prometheus to send these metrics directly to OpenObserve for centralized visualization and long-term storage.

3.1 Configure Remote Write to OpenObserve

prometheus o2

To send data to OpenObserve, add a remote_write section in prometheus.yml:

remote_write:
  - url: https://<openobserve_host>/api/<org_name>/prometheus/api/v1/write
    queue_config:
      max_samples_per_send: 10000
    basic_auth:
      username: <openobserve_user>
      password: <openobserve_password>
  • url: This specifies the OpenObserve endpoint where Prometheus sends metrics data.
  • queue_config: Configures the batch size (max_samples_per_send) of metrics sent to OpenObserve, helping manage throughput.
  • basic_auth: Provides secure access to OpenObserve, ensuring only authorized users send data.

3.2 Testing the Remote Write Configuration

  1. Install Prometheus to configure and scrape the endpoints:
wget https://github.com/prometheus/prometheus/releases/download/v2.30.3/prometheus-2.30.3.linux-amd64.tar.gz
tar -xvf prometheus-2.30.3.linux-amd64.tar.gz
cd prometheus-2.30.3.linux-amd64
  1. Start Prometheus to apply the new configuration:
./prometheus --config.file=prometheus.yml
  1. Verify Metrics Ingestion in OpenObserve: Log into OpenObserve’s dashboard, and confirm that it’s receiving Prometheus data. You should see metrics from Node Exporter and your custom application populating the OpenObserve dashboard.

4. Visualizing Metrics Directly in OpenObserve

With metrics ingested into OpenObserve, you can now use its visualization tools to create insightful dashboards and analyze data. Here’s a step-by-step guide for setting up and customizing these visualizations.

4.1 Setting Up a New Dashboard

  1. Create a Dashboard:

    • In OpenObserve, navigate to the Dashboards section.
    • Select Create New Dashboard and give it a meaningful name like “System and Application Metrics.”
  2. Add Panels for Key Metrics:

    • You can add various panels for specific metrics (e.g., CPU usage, memory, application request count).

4.2 Example Panels for Common Metrics

Here are a few examples of commonly used panels with corresponding queries.

dashboards

Total Application Requests:

  • Metric: app_request_count
  • Visualization: Select a line or bar chart to show the count over time.

    Average Request Latency:

  • Metric: app_request_latency_seconds
  • Query: Use an aggregation function to show average latency over time.
  • Visualization: Use a gauge or time series chart.

    System CPU Usage:

  • Metric: node_cpu_seconds_total
  • Query: rate(node_cpu_seconds_total[5m]) by (instance)
  • Visualization: Use a line chart to show CPU usage trends by instance.

    Memory Utilization:

  • Metric: node_memory_MemAvailable_bytes
  • Query: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
  • Visualization: Display memory usage over time with a line chart.

    Disk Usage:

  • Metric: node_filesystem_free_bytes
  • Query: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes
  • Description: Monitors disk usage as a percentage of the total available disk space. This is essential for tracking storage capacity and avoiding potential disk saturation.
  • Visualization: Use a line or area chart to track disk usage over time, with critical usage thresholds highlighted.

    Network I/O:

  • Metric: node_network_receive_bytes_total and node_network_transmit_bytes_total
  • Query:
    • Receive: rate(node_network_receive_bytes_total[5m])
    • Transmit: rate(node_network_transmit_bytes_total[5m])
  • Description: Tracks network traffic, showing both incoming and outgoing data in bytes per second. This helps detect network bottlenecks and monitor bandwidth usage.
  • Visualization: Use a dual-axis line chart or two separate line charts to distinguish between received and transmitted data.

    CPU Load Average:

  • Metric: node_load1, node_load5, node_load15
  • Query: Use node_load1, node_load5, and node_load15 directly to show 1-minute, 5-minute, and 15-minute CPU load averages.
  • Description: Provides insight into CPU load trends over different time frames, helping to identify periods of high CPU usage and assess system load.
  • Visualization: Use a line chart with multiple series for each load metric to compare short-term and long-term CPU load averages.

    Memory Usage:

  • Metric: node_memory_MemAvailable_bytes and node_memory_MemTotal_bytes
  • Query: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
  • Description: Tracks the percentage of memory currently in use. High memory usage over time can indicate the need for additional resources or optimizations.
  • Visualization: Use a line chart or gauge to show memory usage over time, with thresholds for low, moderate, and high memory usage.

    System Context Switches:

  • Metric: node_context_switches_total
  • Query: rate(node_context_switches_total[5m])
  • Description: Counts the rate of context switches per second, which can help monitor CPU scheduling. A high rate of context switches may indicate heavy multitasking or performance bottlenecks.
  • Visualization: Use a line chart to track context switches over time, identifying any unusual spikes that could signify performance issues.

    System Uptime:

  • Metric: node_time_seconds and node_boot_time_seconds
  • Query: node_time_seconds - node_boot_time_seconds
  • Description: Calculates the node's uptime by subtracting the boot time from the current system time. Useful for tracking system reliability and uptime compliance.
  • Visualization: Use a single-value chart showing total uptime in hours, days, or weeks, depending on the length of operation.

To make it easier to set up, here is an attached JSON file that you can import directly into your OpenObserve dashboard. This file includes pre-configured panels for each of the metrics described above, allowing you to get started with node-level monitoring quickly and efficiently.

How to Import Your Dashboard to OpenObserve:

  1. Download the JSON file to your local system.
  2. In OpenObserve, navigate to Dashboards and select Import.
  3. Upload the JSON file, and OpenObserve will automatically configure the panels and visualizations.

This will set up a complete node-level monitoring dashboard with metrics ready to go! Let me know if you need the JSON file tailored further.

import dashboard

4.3 Configuring Alerts in OpenObserve

OpenObserve supports alerts, enabling you to set thresholds on critical metrics. For example, you can set an alert for high memory usage:

  1. Create a New Alert:

    • In OpenObserve, go to the Alerts section.
    • Configure a new alert rule based on the node_memory_MemAvailable_bytes metric to monitor available memory.
  2. Define Alert Conditions:

    • Set conditions, such as alerting if memory usage is below a specified threshold for an extended period.
  3. Choose Notification Channels:

    • Configure the destination (that was set up for Slack or Email) to ensure you’re alerted promptly.

alerts

5. Optimizing Prometheus Metrics for Scalable, Insightful Observability

prometheus o2

Optimizing your Prometheus setup is essential for efficient metrics management and powerful monitoring of application and infrastructure health. By implementing best practices like tuning scrape intervals, managing label cardinality, and tracking Prometheus health metrics, you ensure scalable and insightful observability. This approach helps maintain system stability and supports proactive improvements, making your Prometheus monitoring both effective and future-ready.

Best PracticeDescriptionBenefit
Optimize Scrape IntervalsSet appropriate scrape intervals to balance data granularity with system load. Adjust intervals based on the metric’s importance and frequency of change.Reduces system load, avoids data overload, and maintains relevant metrics without excessive detail.
Manage Label CardinalityLimit the number of unique label combinations (cardinality) to prevent excessive memory and CPU use. Avoid using high-cardinality labels (e.g., UUIDs).Enhances performance, reduces memory usage, and prevents excessive data ingestion costs.
Use Remote Write for Long-Term StorageConfigure Prometheus to send data to OpenObserve using Remote Write, ensuring that older metrics are stored efficiently outside of Prometheus’ local storage.Extends data retention, reduces local storage pressure, and enables long-term trend analysis.
Implement Recording RulesDefine recording rules for frequently queried metrics, and precomputing results to avoid redundant calculations at query time.Speeds up query performance, reduces load on Prometheus, and improves user experience.
Monitor Prometheus Health MetricsTrack Prometheus’s own health metrics (e.g., memory usage, CPU, scrape duration) to proactively manage and scale the Prometheus instance as needed.Prevents performance bottlenecks, enables proactive troubleshooting, and ensures reliable monitoring.
Centralize Metrics in OpenObserveAggregate and visualize metrics in OpenObserve, allowing for enhanced analytics, dashboards, and alerting across Prometheus, Node Exporter, and other sources.Provides a centralized observability platform, improving insights and simplifying management tasks.
Automate Alerts and NotificationsSet up alerts for key metrics and system performance thresholds to catch issues early, preventing downtime or degradation.Enhances response time, prevents downtime, and supports proactive system management.
Balance Retention Policies with Data NeedsAdjust Prometheus data retention based on operational needs and data utility, ensuring only necessary data is retained.Optimizes storage costs, maintains data relevancy, and reduces unnecessary data accumulation.

Get Started with OpenObserve for Effortless Metrics Management

Ready to take your observability to the next level? OpenObserve offers a seamless platform for visualizing and storing Prometheus metrics long-term, all in one place. Start your journey with OpenObserve today to centralize your metrics, streamline data retention, and enhance your monitoring capabilities.

Get started with OpenObserve now and unlock powerful insights into your systems!

Author:

authorImage

Chaitanya Sistla is a Principal Solutions Architect with 14X certifications across Cloud, Data, DevOps, and Cybersecurity. Leveraging extensive startup experience and a focus on MLOps, Chaitanya excels at designing scalable, innovative solutions that drive operational excellence and business transformation.

OpenObserve Inc. © 2024