Resources

Monitoring Kubernetes System Components with Prometheus Metrics

September 13, 2024 by OpenObserve Team
kubernetes prometheus metrics

Prometheus, a powerful open-source monitoring tool, plays a pivotal role in this process. Its compatibility with Kubernetes makes it an essential component for collecting and analyzing metrics from various system components.

Prometheus integrates seamlessly with Kubernetes, allowing you to monitor the health and performance of your clusters effectively. Metrics are vital for identifying potential issues, optimizing resource usage, and maintaining overall system health. This section will explore how Prometheus fits into the Kubernetes ecosystem and the benefits it brings to your monitoring strategy.

By understanding how Prometheus works with Kubernetes, you'll be equipped to implement a robust monitoring solution that ensures your applications run smoothly and efficiently.

Kubernetes System Components Monitoring

To effectively monitor your Kubernetes environment, it’s essential to understand the critical components that require close observation. These components include nodes, pods, and the control plane, all of which are crucial for the seamless operation of your clusters.

Critical Components of Kubernetes

  1. Nodes: The worker machines in Kubernetes that run containerized applications. Monitoring node performance and resource usage is essential for maintaining cluster health.
  2. Pods: The smallest deployable units in Kubernetes, consisting of one or more containers. Monitoring pod status and resource consumption helps in ensuring application reliability.
  3. Control Plane: Manages the Kubernetes cluster and includes components such as kube-apiserver, kube-scheduler, and kube-controller-manager. Ensuring these components function correctly is vital for the stability of the entire cluster.

Metrics Exposed by Kubernetes System Components

Kubernetes exposes a variety of metrics that provide insights into the performance and health of these components. These metrics include:

  • Node Metrics: CPU, memory usage, disk I/O, and network activity.
  • Pod Metrics: CPU and memory usage, pod restarts, and lifecycle events.
  • Control Plane Metrics: Request latencies, error rates, and resource usage of control plane components.

How Prometheus Collects Metrics from Kubernetes

Prometheus collects metrics from Kubernetes system components through a process called scraping. It periodically scrapes HTTP endpoints that expose metrics in a format Prometheus can understand. Kubernetes components expose these metrics at specific endpoints, such as /metrics for the kube-apiserver or /metrics/cadvisor for node metrics.

Prometheus’s ability to scrape these endpoints and store the metrics in a time-series database allows for powerful querying and analysis. This setup ensures you have   detailed insights into the performance and health of your Kubernetes clusters.

By monitoring these key components and metrics, you can proactively manage your Kubernetes environment, quickly identify issues, and maintain optimal performance and reliability.

Setting Up Prometheus for Kubernetes Monitoring

Deploying Prometheus within a Kubernetes cluster is a straightforward process that significantly enhances your ability to monitor and manage your cluster's performance. Here's a step-by-step guide to setting up Prometheus for Kubernetes monitoring.

Deploying Prometheus within a Kubernetes Cluster

Using Helm

Helm is a popular package manager for Kubernetes that simplifies the deployment process. You can use the Prometheus Helm chart to deploy Prometheus quickly.

helm repo add prometheus-community

https://prometheus-community.github.io/helm-charts

helm repo update

helm install prometheus prometheus-community/prometheus

Prometheus Operator

The Prometheus Operator simplifies Prometheus setup and management on Kubernetes. It automates the configuration and deployment of Prometheus instances.

kubectl create -f

https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml

Configuring Prometheus to Discover and Scrape Kubernetes Metrics Endpoints

Prometheus needs to be configured to discover and scrape metrics from various Kubernetes components. This involves setting up service discovery and scrape configurations.

Service Discovery

Configure Prometheus to automatically discover Kubernetes services and endpoints.

scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: \[__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https

Scrape Configurations

Define the intervals and specific endpoints Prometheus should scrape.

scrape_configs:
  - job_name: 'kubelets'
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node\_label\_(.+)

Using Prometheus Operator to Streamline Prometheus Deployments on Kubernetes

The Prometheus Operator simplifies managing Prometheus deployments by automating tasks like configuration and updates. It integrates seamlessly with Kubernetes, allowing you to define and manage Prometheus instances using Kubernetes manifests.

Deploying Prometheus Operator

The Prometheus Operator can be installed using Helm or directly applying Kubernetes manifests.

kubectl apply -f

https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml

Creating Prometheus Instances

Define a Prometheus instance using a Custom Resource Definition (CRD).

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: k8s
namespace: monitoring
spec:
serviceAccountName: prometheus-k8s
serviceMonitorSelector:
matchLabels:
team: frontend

By following these steps, you can set up Prometheus to monitor your Kubernetes cluster effectively. This setup ensures you have comprehensive visibility into your cluster's performance and health, enabling proactive management and quick issue resolution.

Key Metrics for Kubernetes Components

Identifying and understanding crucial metrics for the kube-apiserver, kube-scheduler, kube-controller-manager, and kubelet:

  • kube-apiserver: Metrics like apiserver_request_duration_seconds to monitor request latency and apiserver_request_total for tracking the total number of API requests.
  • kube-scheduler: Metrics such as scheduler_binding_duration_seconds for binding latency and scheduler_e2e_scheduling_duration_seconds for end-to-end scheduling latency.
  • kube-controller-manager: Metrics including workqueue_adds_total to count the number of times an item is added to the work queue and workqueue_queue_duration_seconds for queue duration.
  • kubelet: Key metrics like kubelet_running_pod_count to monitor running pods and kubelet_container_cpu_usage_seconds_total for container CPU usage.

Monitoring resource usage of Kubernetes pods with /metrics/cadvisor:

  • Use /metrics/cadvisor to track pod resource usage, including CPU, memory, and disk I/O metrics. This helps in identifying resource constraints and optimizing pod performance.

Tracking performance and health of the Kubernetes control plane:

  • Monitor control plane metrics to ensure the health and performance of your Kubernetes cluster. Key metrics include etcd_server_has_leader to confirm the presence of a leader and etcd_disk_wal_fsync_duration_seconds for write-ahead log sync duration.

Including OpenObserve for Enhanced Monitoring:

  • OpenObserve Integration: Enhance your monitoring setup by integrating Prometheus with OpenObserve. Utilize OpenObserve’s advanced analytics and visualization capabilities to gain deeper insights into your Kubernetes metrics.
  • Seamless Data Flow: Ensure that your Prometheus metrics flow seamlessly into OpenObserve, allowing for comprehensive analysis and real-time monitoring.

Ready to elevate your Kubernetes monitoring? Sign up for OpenObserve, explore our GitHub repository, or book a demo

Prometheus Query Language (PromQL)

PromQL is a powerful tool for querying and aggregating Kubernetes metrics. It allows you to create custom queries that provide insights into cluster health and resource utilization. Here are a few practical examples to help you get started:

Querying CPU Usage:
To get the current CPU usage across all nodes, you can use the query:

sum(rate(node_cpu_seconds_total{mode!="idle"}\[5m])) by (node)


Memory Usage:
To monitor memory usage, you can use:
sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) by (node)

Pod Status:
To check the status of all pods, you can use:
count(kube_pod_status_phase{phase="Running"}) by (namespace)

Disk Space Usage:

To monitor disk space, use:
sum(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) by (node)

These queries provide a starting point for monitoring various aspects of your Kubernetes cluster. 

Integrating OpenObserve can further enhance your monitoring capabilities, offering advanced data visualization and real-time analytics based on PromQL queries. This integration allows for a more comprehensive understanding of your cluster's performance and health.

Visualizing Kubernetes Metrics with OpenObserve

Integrating Prometheus with OpenObserve enables advanced data visualization and long-term storage of your Kubernetes metrics. 

OpenObserve enhances your monitoring capabilities by providing deep insights and robust analytics tailored to Kubernetes environments.

Advantages of OpenObserve

  1. Real-Time Analytics: OpenObserve allows you to perform real-time analytics on your Kubernetes metrics. This capability ensures you can identify and resolve issues promptly, maintaining the health and performance of your clusters.
  2. Custom Dashboards: With OpenObserve, you can create highly customizable dashboards tailored to your specific needs. These dashboards provide a comprehensive view of your Kubernetes environment, including critical metrics and system performance indicators.
  3. Scalability: Designed to scale with your infrastructure, OpenObserve can handle large volumes of data efficiently. This scalability ensures that as your Kubernetes clusters grow, your monitoring solution remains effective and performant.
  4. Unified Log Aggregation: OpenObserve not only supports metric visualization but also log aggregation. This unified approach allows you to correlate metrics with logs, providing a more holistic view of your system's state and facilitating faster troubleshooting.

Implementing OpenObserve with Prometheus

To maximize the benefits of OpenObserve, it's essential to configure Prometheus to run in agent mode. In this configuration, Prometheus focuses solely on scraping metrics and sending them to OpenObserve for storage and visualization. Here’s a step-by-step guide:

Configure Prometheus in Agent Mode:

Modify the Prometheus configuration file to enable agent mode. This mode reduces the resource consumption of Prometheus, making it suitable for environments with high data volumes.

Ensure the remote_write section of the Prometheus configuration is set to send metrics to OpenObserve.

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes'
    kubernetes_sd_configs:
      - role: node

remote_write:
  - url: 'https://<your-openobserve-instance>/api/v1/write'

Deploy OpenObserve:

Install and configure OpenObserve to receive and store metrics from Prometheus. Follow the OpenObserve installation guide to set up the necessary infrastructure.

Set up data retention policies and storage backends to manage long-term metric storage efficiently.

Create Dashboards in OpenObserve:

Use OpenObserve’s dashboarding capabilities to create custom views of your Kubernetes metrics. Leverage pre-built templates or design your own dashboards to monitor key performance indicators.

Set Up Alerts:

Configure alerting rules in OpenObserve to notify you of any critical issues within your Kubernetes clusters. Utilize the alerting framework to set thresholds and conditions based on the metrics collected.

Enhance your Kubernetes observability with OpenObserve. 

Sign up for our cloud service here, explore our GitHub repository here, or book a demo to see OpenObserve in action here

Alerting with Prometheus

Effective monitoring isn't complete without a robust alerting system. Setting up alerts ensures you can respond promptly to any issues that arise within your Kubernetes clusters. Here's how to configure alerting with Prometheus and Alertmanager.

Configuring Alertmanager to Handle Alerts Generated by Prometheus

Alertmanager is a key component in the Prometheus ecosystem, responsible for processing alerts sent by Prometheus. It manages alerts by deduplicating, grouping, and routing them to the appropriate receiver endpoints. Follow these steps to set up Alertmanager:

Install Alertmanager:

  1. Download and install Alertmanager on your Kubernetes cluster.
  2. Configure Alertmanager to receive alerts from Prometheus by modifying the alertmanager.yml configuration file.

global:
  resolve_timeout: 5m

route:
  receiver: 'default-receiver'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'your-email@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'

Integrate Prometheus with Alertmanager:

Modify the Prometheus configuration to include the Alertmanager endpoint.

alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- 'alert_rules.yml'

Setting Up Alert Rules for Kubernetes System Components Monitoring

Define alert rules to monitor critical metrics and trigger alerts when specific conditions are met. Here are some common alert rules:

High CPU Usage:

Alert when CPU usage on any node exceeds 80%.

groups:
- name: cpu_alerts
rules:
- alert: HighCPUUsage
expr: sum(rate(node_cpu_seconds_total{mode!="idle"}\[5m])) by (instance) / sum(machine_cpu_cores) by (instance) > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage has been above 80% for more than 5 minutes."

Pod CrashLoopBackOff:

Alert when a pod is in a CrashLoopBackOff state.

groups:
- name: pod_alerts
rules:
- alert: PodCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod in CrashLoopBackOff state"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is in CrashLoopBackOff state."

Best Practices for Managing Alerts

To ensure your alerting system is effective and manageable, follow these best practices:

  1. Group Alerts: Group related alerts to reduce noise and make it easier to identify the root cause of issues.
  2. Deduplication: Configure Alertmanager to deduplicate alerts, preventing multiple alerts for the same issue.
  3. Routing: Set up routing rules to send alerts to the appropriate teams or individuals based on the severity and nature of the alert.
  4. Escalation Policies: Implement escalation policies to ensure critical alerts are addressed promptly if initial responders do not take action.
  5. Silencing: Use silencing rules to temporarily mute alerts during planned maintenance or known issues to avoid unnecessary distractions.

Configuring a robust alerting system with Prometheus and Alertmanager is essential for maintaining the health and performance of your Kubernetes clusters. 

By setting up effective alert rules and following best practices, you can ensure prompt responses to critical issues and maintain the reliability of your Kubernetes environment.

Advanced Monitoring Concepts

Advanced monitoring techniques and tools are essential for maintaining and optimizing large-scale Kubernetes environments. This section delves into more sophisticated approaches, focusing on enabling kube-state-metrics, utilizing advanced monitoring strategies, and scaling Prometheus for extensive Kubernetes clusters.

Enabling and Utilizing kube-state-metrics for Enriched Kubernetes Metrics

kube-state-metrics is a crucial component for advanced Kubernetes monitoring. It provides detailed metrics about the state of Kubernetes resources, such as deployments, pods, and nodes. These metrics go beyond what is available from the default Kubernetes metrics, offering a more comprehensive view of the cluster state.

Installing kube-state-metrics:

Deploy kube-state-metrics within your Kubernetes cluster using the official Helm chart or by applying the provided manifests.

helm install kube-state-metrics kube-state-metrics/kube-state-metrics

Configuring kube-state-metrics:

Ensure that Prometheus is configured to scrape metrics from kube-state-metrics. Add the appropriate scrape configuration to your Prometheus configuration file.

scrape_configs:
- job_name: 'kube-state-metrics'
static_configs:
- targets: \['<kube-state-metrics-service>:8080']

Utilizing Metrics:

Leverage the enriched metrics provided by kube-state-metrics to monitor the state and performance of Kubernetes resources. For example, track the status of deployments and the number of replicas available versus desired.

sum(kube_deployment_status_replicas_available) by (namespace, deployment)

Monitoring Kubernetes Cluster State with Prometheus and kube-state-metrics

Combining Prometheus with kube-state-metrics provides a powerful monitoring solution that offers deeper insights into the state of your Kubernetes clusters. Here are some key metrics and their importance:

Deployment Health:

Monitor the number of available versus desired replicas in your deployments to ensure they are operating as expected.

kube_deployment_status_replicas_available /

kube_deployment_status_replicas_desired

Pod Status:

Track the status of pods to identify any that are not running as expected.

count(kube_pod_status_phase{phase!="Running"}) by (namespace, pod)

Node Health:

Monitor node conditions and resource usage to maintain a healthy cluster.

kube_node_status_condition{condition="Ready", status="true"}

Scaling Prometheus Monitoring for Large Kubernetes Clusters

As your Kubernetes environment grows, scaling your monitoring setup becomes critical. Here are strategies to scale Prometheus effectively:

Federation:

Use Prometheus federation to scale horizontally by aggregating metrics from multiple Prometheus instances. This approach distributes the load and ensures that no single instance becomes a bottleneck.

scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
match\[]:
- '{job="prometheus"}'
static_configs:
- targets:
- 'prometheus-1:9090'
- 'prometheus-2:9090'

Sharding:

Implement sharding to distribute metrics collection across multiple Prometheus instances. This method involves splitting the metrics load based on specific criteria, such as namespaces or node labels.

Remote Write:

You can use the remote write feature to send metrics to a reliable long-term storage backend like OpenObserve. This allows Prometheus to focus on efficient data collection and query processing, while OpenObserve takes care of storing and processing these metrics at scale, ensuring better performance and long-term accessibility of your data. It's an ideal solution for offloading Prometheus while maintaining seamless observability and analysis.

remote_write:
- url: 'https://your-storage-backend/api/v1/write'

High Availability:

Ensure high availability by deploying multiple Prometheus instances in an active-passive or active-active configuration. Use tools like Thanos or Cortex to achieve a highly available, scalable, and long-term storage solution for Prometheus. 

Advanced monitoring techniques, such as enabling kube-state-metrics, utilizing Prometheus federation, and implementing sharding, are essential for effectively managing large-scale Kubernetes environments. 

Optimized Scraping Configuration: Reducing Overhead Efficiently

Limit Scrape Targets: Reducing the number of scrape targets for each Prometheus instance prevents overloading any single instance. You can assign specific metrics to different instances, spreading the load evenly.

Service Discovery: Implement dynamic service discovery, especially in environments like Kubernetes. This ensures that Prometheus scrape targets are automatically updated when services are added or removed, reducing manual intervention and improving accuracy.

Scraping Intervals: Adjust the scraping intervals and timeouts to suit the criticality of your applications. For high-priority apps, use shorter intervals, while for less critical services, longer intervals can reduce load on the system.

These practices enhance Prometheus’s scalability while maintaining performance. Pairing Prometheus with OpenObserve as a storage backend allows you to offload metric storage and processing, keeping your observability stack efficient and ready for long-term data analysis.

By leveraging these strategies, you can achieve comprehensive visibility, maintain optimal performance, and ensure the reliability of your Kubernetes clusters.

Conclusion

Prometheus plays a critical role in the Kubernetes ecosystem, providing the necessary metrics to ensure the health and performance of your clusters. Advanced monitoring techniques, such as utilizing kube-state-metrics, leveraging PromQL, and implementing scalable monitoring strategies, are essential for effectively managing large-scale Kubernetes environments. Continuous monitoring and proactive alerting are key to maintaining optimal cluster performance.

To enhance your Kubernetes observability further, consider integrating OpenObserve. OpenObserve offers advanced data visualization, real-time analytics, and long-term storage for your metrics. Sign up for our cloud service here, explore our GitHub repository here, or book a demo to see OpenObserve in action here. Enhance your monitoring capabilities with OpenObserve today!

Author:

authorImage

The OpenObserve Team comprises dedicated professionals committed to revolutionizing system observability through their innovative platform, OpenObserve. Dedicated to streamlining data observation and system monitoring, offering high performance and cost-effective solutions for diverse use cases.

OpenObserve Inc. © 2024