The Prometheus Cardinality Bomb: How to Prevent It Before It Blows Up


Try OpenObserve Cloud today for more efficient and performant observability.
Get Started For Free
The alert fires at 2:17 AM. Grafana dashboards are blank. Prometheus is OOM-killed and won't stay up. Queries that used to run in milliseconds are now timing out at 30 seconds , or not returning at all. The on-call engineer opens the runbook, finds nothing useful, and starts restarting pods.
Three hours later, after a war room, a rollback, and a lot of coffee, someone finds the root cause: a single line of instrumentation code, merged three months ago by a well-meaning developer who added a user_id label to the main request counter.
No alarms went off at the time. The metric itself looked fine. But every time a new user hit the service, Prometheus silently created a new time series. After 90 days and a million users, that one label had generated over five million time series , and the in-memory TSDB had finally buckled under the weight.
This is the cardinality bomb. It doesn't detonate the moment you pull the pin. It waits.
To understand why this happens, you need to understand how Prometheus stores data.
Prometheus is a time-series database. It doesn't store a single "counter" , it stores one independent time series per unique combination of label values. Every label you attach to a metric multiplies the number of time series Prometheus must track, index, and hold in memory.
Here's the math in plain English:
http_requests_total{environment="prod", service="checkout", status_code="200"} → 1 series
http_requests_total{environment="prod", service="checkout", status_code="404"} → 1 series
http_requests_total{environment="prod", service="payments", status_code="200"} → 1 series
...and so on
A metric with 3 environments × 5 services × 10 status codes yields 150 time series , entirely manageable.
Now add user_id to that same metric with 1 million unique users:
3 environments × 5 services × 10 status_codes × 1,000,000 user_ids = 150,000,000 series
That's 150 million time series , each one occupying RAM in Prometheus's TSDB. The process doesn't swap to disk gracefully; it OOMs and dies.
This is cardinality: the number of unique time series for a given metric. High cardinality = high memory pressure = instability. And the relationship is roughly linear: double the unique label values, double the memory usage.
For a deeper foundation on how Prometheus stores and scrapes this data, see What You Need to Know About Prometheus Architecture.
Not all labels are created equal. Some labels are bounded , their set of possible values is small and stable (a handful of environments, a known list of HTTP status codes). Others are unbounded , new values arrive continuously and there's no ceiling on how many can appear.
Unbounded labels are cardinality bombs. Here are the most common offenders:
| Label | Why It's Dangerous |
|---|---|
user_id |
One new series per user. At scale, this is millions of series. |
session_id |
Even more volatile , sessions expire but the time series persist until retention cutoff. |
request_id / trace_id |
Unique per request by design. A high-traffic API generates thousands per second. |
ip_address |
Unbounded by nature. Especially dangerous in public-facing APIs. |
url_path (raw) |
Paths with dynamic segments like /users/12345/orders explode into one series per path permutation. |
container_id / pod_hash |
Container runtimes rotate these constantly. Every new deploy floods Prometheus with fresh series. |
error_message (raw) |
Error strings often contain dynamic content (timestamps, IDs, filenames). |
If the number of unique values for a label can grow without a defined ceiling, it is not a label. It is a trace attribute.
A status_code label is fine: HTTP gives you roughly 60 defined codes and you'll realistically see fewer than 15. A user_id label is not fine: it scales with your user base and never stops growing.
When in doubt, ask: "Could this label have 10,000 unique values in production?" If the answer is yes, it belongs in a trace span, not a metric label.
For more on Prometheus metric types and when to use counters vs. gauges vs. histograms, see Prometheus Metric Types (Counters, Gauges, Histograms, Summaries).
The core architectural insight behind preventing cardinality explosions is this: metrics and traces serve fundamentally different purposes, and mixing their data models is the source of most cardinality mistakes.
Metrics answer questions about system behavior in aggregate:
prod namespace?These questions are answered by bounded, low-cardinality dimensions , a fixed set of services, environments, HTTP status codes, and so on. The power of metrics is that they give you instant, pre-aggregated answers across your entire fleet without scanning raw events.
Traces answer questions about specific request instances:
These questions require high-cardinality identifiers , user IDs, request IDs, session IDs, trace IDs , because you're looking at individual events, not aggregations. Trace backends are designed for exactly this: indexing and retrieving individual spans by arbitrary attribute values.
If you need to answer "what happened to user 12345's request," open a trace. If you need to answer "what is the error rate for the checkout service," query a metric. These are different tools built for different jobs, and conflating them breaks both.
For a practical guide to implementing distributed tracing with OpenTelemetry and sending high-cardinality span attributes to a trace backend, see A Comprehensive Guide to Distributed Tracing: From Basics to Beyond.
To understand how logs, metrics, and traces work together as a unified observability system, see Full-Stack Observability: Connecting Logs, Metrics, and Traces.
If you suspect a cardinality problem is already underway , or want to build a cardinality dashboard before one occurs , here's a step-by-step playbook.
Start by checking the current number of active time series in your TSDB head:
prometheus_tsdb_head_series
A healthy Prometheus instance for a mid-sized production environment typically sits between 100,000 and 2,000,000 series. If you're north of 5 million, you likely have a cardinality problem. If you're north of 10 million, it's urgent.
Use this query to list every metric name sorted by its series count, descending:
# Series count per metric name , your cardinality leaderboard
sort_desc(
count by (__name__) ({__name__=~".+"})
)
⚠️ Warning: This query is expensive. Run it during off-peak hours or with a short timeout. On a large instance, it may itself cause performance issues.
For a lighter-weight alternative that targets known problem areas:
# Count series for a specific metric , useful when you suspect a culprit
count(http_requests_total) by (user_id)
Once you've found a high-cardinality metric, identify which label is responsible:
# How many unique values does each label have for this metric?
count(count by (user_id) (http_requests_total))
count(count by (status_code) (http_requests_total))
count(count by (service_name) (http_requests_total))
The label with the largest count is your bomb.
For more on counting unique series and understanding cardinality metrics in Prometheus, see Prometheus Metrics Count Basics.
If you need to preserve some data from a high-cardinality metric while you plan a proper fix, recording rules let you pre-aggregate the expensive metric into a cheaper derived one. Add this to your rules.yml:
groups:
- name: cardinality_control
interval: 1m
rules:
# Aggregate away user_id , keep only the dimensions you actually alert on
- record: http_requests_total:by_service_and_status
expr: sum by (service_name, status_code, environment) (http_requests_total)
This creates a new metric with manageable cardinality. You can then alert on the recording rule output while you remove the problematic label from your instrumentation.
To drop user_id only from the http_requests_total metric without affecting other metrics, use this pattern in your prometheus.yml:
scrape_configs:
- job_name: "my-service"
static_configs:
- targets: ["my-service:8080"]
metric_relabel_configs:
# STEP 1: Identify the "bomb" metric and the label to drop.
# We use the semicolon separator (default) to match Name;LabelValue
- source_labels: [__name__, user_id]
regex: 'http_requests_total;(.+)'
target_label: user_id
replacement: '' # Setting to empty string effectively removes it from the unique index
action: replace
# STEP 2: Optional - Cleanly remove the label key entirely from storage
- source_labels: [user_id]
regex: '^$' # Matches the empty value we just set
action: labeldrop
If you know that user_id provides zero value across your entire job and you want it gone from every single metric to save maximum RAM, use this simpler (but more aggressive) rule:
metric_relabel_configs:
- action: labeldrop
regex: 'user_id' # Removes this label from EVERY metric in this scrape job
⚠️ Warning: labeldrop is irreversible at the ingestion point. Once you drop it here, the data is gone. You cannot "un-drop" it later to see which user caused a specific error. This is why high-cardinality data belongs in Traces (OpenTelemetry) or Columnar Backends (OpenObserve) where it can be stored cheaply.
Add this alert to your Prometheus alerting rules to catch future cardinality growth before it becomes an outage:
groups:
- name: cardinality_alerts
rules:
- alert: HighCardinalityMetric
expr: |
sort_desc(count by (__name__) ({__name__=~".+"})) > 500000
for: 10m
labels:
severity: warning
annotations:
summary: "Metric {{ $labels.__name__ }} has {{ $value }} time series"
description: >
A single metric has exceeded 500k time series.
Investigate label cardinality immediately.
- alert: PrometheusSeriesCountCritical
expr: prometheus_tsdb_head_series > 8000000
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB series count is critically high ({{ $value }})"
Prometheus's in-memory TSDB model is one of its greatest strengths: it makes PromQL blazing fast for recent data across a bounded set of time series. But it is also its fundamental constraint. Every active time series must fit in RAM. There is no overflow, no columnar spill-to-disk, no dynamic sharding. You either fit, or you OOM (Out of Memory).
For teams that have hit this ceiling , or are designing systems where high-cardinality metrics are unavoidable (SaaS platforms with per-tenant metrics, edge networks with per-node telemetry, platforms tracking thousands of dynamic endpoints) , the architectural answer is a backend that doesn't share Prometheus's in-memory constraint.
Prometheus stores each time series as an independent in-memory stream. The moment a new label combination appears, a new stream is allocated. This is optimal for fast range queries over a known, stable set of series , but it makes cardinality a first-class resource problem.
Columnar storage backends (like the one used by OpenObserve) store metric data as columns in compressed files on object storage (S3, GCS, or similar). Rather than allocating a new data structure per unique label combination, data is written in bulk and queried by scanning compressed columns. There is no per-series memory allocation at ingest time.
The practical consequences:
user_id to a metric and the backend doesn't OOM , it just writes data. Query performance degrades gracefully with cardinality rather than catastrophically. This doesn't mean you should abandon cardinality discipline. Even in columnar backends, high-cardinality queries scan more data and cost more to execute. But it changes cardinality from an availability problem (Prometheus goes down) into a performance and cost trade-off that you can manage deliberately.
For most teams, the path forward is not "replace Prometheus" , it's "use Prometheus for what it's good at, and offload everything else."
Prometheus handles real-time scraping and alerting with its familiar local TSDB. OpenObserve receives the same data via remote_write and handles long-term retention, historical queries, cross-signal correlation (logs + metrics + traces), and any metrics where cardinality makes local storage impractical.
# prometheus.yml , add remote_write to OpenObserve
remote_write:
- url: "https://<your-openobserve-host>/api/<org>/prometheus/api/v1/write"
queue_config:
max_samples_per_send: 10000
basic_auth:
username: <openobserve_user>
password: <openobserve_password>
For step-by-step setup of Prometheus remote write to OpenObserve, see the official ingestion documentation.
For a detailed comparison of how Prometheus and OpenObserve handle cardinality in practice , including real cost data , see DataDog Alternative Metrics: OpenObserve PromQL Unlimited.
For Kubernetes-specific metrics collection with OpenTelemetry and Prometheus, see Enhancing Kubernetes Metrics Collection With OpenTelemetry and Prometheus.
Before adding any label to any metric, run through this checklist:
sum by (this_label) in a PromQL query? If not, it probably belongs in a trace span. A single label choice made at 2 PM on a Tuesday can bring down your metrics stack at 2 AM on a Sunday. The cardinality bomb doesn't make noise when you arm it. The checklist above is how you defuse it before it's too late.