Why can't I just use my LLM provider's billing dashboard?

Provider billing dashboards give monthly aggregates by API key — they tell you how much you spent, not why. Actionable cost attribution (by feature, user, or model) requires instrumented telemetry inside your own application shipped to an observability platform.

Can OpenObserve monitor costs for self-hosted models like Ollama or vLLM?

Yes. For self-hosted models, compute cost from GPU instance hourly rate × inference duration instead of provider token pricing, and emit it using the same cost field names. The dashboard works identically regardless of whether cost represents API charges or compute charges.

What is a healthy P99/P50 cost ratio?

Below 10× is excellent. 10–25× is normal. 25–50× warrants investigation. Above 50× is critical and almost always indicates maxtokens is unconstrained on at least one high-traffic operation.

How do I handle LLM provider pricing changes?

Keep a versioned pricing config file separate from your application code. Update the config when prices change — no code deployment required. Use a CI check to flag queries referencing model names not present in the config.

What does it cost to run OpenObserve for LLM telemetry?

At 1 million LLM calls per day with 90-day retention, storage typically runs $1–4/month on S3 standard. OpenObserve cluster compute for this scale is approximately $30–80/month.

LLM Observability Cost Monitoring OpenObserve MLOps AI OpenTelemetry FinOps

LLM Cost Monitoring with OpenObserve: Track Token Usage, Control AI Spend, and Visualize Every Dollar Across Your LLM Pipelines

Simran Kumari

April 16, 2026

10 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Table of Contents

LLM Cost Monitoring dashboard in OpenObserve showing token usage and spend attribution

The LLM Cost Problem Nobody Talks About

It usually starts the same way. You ship an AI feature, usage picks up, and three weeks later someone forwards a provider billing email with a number that is two or three times what the team estimated. Nobody knows which feature caused it. Nobody knows which model. Nobody knows which user, which prompt, or which pipeline stage.

This is the LLM cost visibility gap and it is endemic to teams building on LLM APIs today.

The root cause is structural. LLM provider billing dashboards give you monthly aggregates by API key. They tell you how much you spent. They do not tell you why you spent it. That requires LLM observability instrumented inside your own application, shipping structured telemetry into a platform that can query it at the span level.

OpenObserve closes this gap entirely. With the right instrumentation and dashboard configuration, you can answer questions like:

"Which operation in our SRE agent consumed 40% of this month's LLM budget?"
"Which user generated the 5 most expensive LLM calls this week, and what did they ask?"
"What is the P99 cost of a single trace through our document summarisation pipeline?"
"Is our input-to-output token ratio degrading a sign of prompt bloat or retrieval over-injection?"
"Which model is responsible for the majority of spend, and is it the one we intended to use?"

The OpenObserve LLM Cost Monitoring dashboard shown in this guide answers all of these questions out of the box.

What Is LLM Cost Monitoring?

LLM cost monitoring is the practice of continuously measuring, attributing, and optimizing the financial costs generated by large language model API calls in production. It is a subdiscipline of LLM observability, the broader practice of understanding LLM system behavior through structured telemetry.

At its core, LLM cost monitoring requires capturing five things on every LLM API call:

Signal	Field Name	Why It Matters
Input tokens	`llm_usage_tokens_input`	Primary cost driver; grows with system prompt size, RAG context, conversation history
Output tokens	`llm_usage_tokens_output`	Often 5–10× more expensive per token than input; unbounded without `max_tokens`
Total tokens	`llm_usage_tokens_total`	Context window utilization; rate-limiting surface
Model identifier	`gen_ai_request_model`	Critical for attribution; tier differences can mean 10–100× cost difference
USD cost	`llm_usage_cost_total`, `llm_usage_cost_input`, `llm_usage_cost_output`	Must be computed at instrumentation time, not inferred from billing exports

These five signals, when combined with application context (operation name, user ID, span ID), form the foundation of a complete LLM cost observability solution.

Why OpenObserve for LLM Observability?

OpenObserve is an open-source, Rust-based observability platform that stores logs, metrics, and traces in columnar Parquet on object storage. It’s a cost-efficient alternative to traditional stacks and commercial tools.

For LLM cost monitoring, it offers four key advantages:

SQL-native queries → Run aggregations like SUM, AVG, GROUP BY, and percentiles directly without learning a new query language
VRL for enrichment → Extract useful context (like last user message) from llm_input directly in dashboards
Percentile analysis → Use APPROX_PERCENTILE_CONT to track P50/P75/P99 cost and identify expensive outliers
JSON extraction with spath() → Pull nested fields (e.g., tool names) from spans without pre-processing

Understanding Your LLM Trace Stream

OpenObserve stream names are user-defined at ingestion time. Whether your stream is called llm_traces, ai_agent_spans, openai_telemetry, or prod_llm_calls makes no functional difference. All SQL queries in this guide use the placeholder <your_llm_trace_stream> substitute your actual stream name throughout.

Core Field Schema

Your LLM trace stream should contain these fields for the dashboard to function correctly:

_timestamp                    event time (auto-indexed by OpenObserve)
span_id                       unique span identifier
operation_name                span/function name (e.g. "llm.chat.completion")
gen_ai_request_model          model identifier (e.g. "gpt-4o-2024-11-20")
gen_ai_tool_name              tool name for tool-use spans (e.g. "tools_call")
llm_usage_tokens_input        input token count
llm_usage_tokens_output       output token count
llm_usage_tokens_total        total token count
llm_usage_cost_input          input cost in USD
llm_usage_cost_output         output cost in USD
llm_usage_cost_total          total cost in USD
llm_input                     raw JSON of the messages array (for VRL extraction)
user_id                       user identifier for per-user cost attribution
span_status                   span status ("OK", "ERROR")
status_code                   numeric status code (0 = OK)

These field names follow the OpenTelemetry GenAI semantic conventions with underscores replacing dots OpenObserve normalizes OTel attribute dots to underscores at ingest time. So gen_ai.request.model in your OTel span becomes gen_ai_request_model in OpenObserve SQL.

Reference: OpenTelemetry for LLM spans

The LLM Cost Dashboard: What to Build and Why

A production-grade LLM cost monitoring dashboard in OpenObserve typically has two functional areas: a cost overview and an agent/tool monitoring view. Here is what each panel is for and what questions it answers.

Cost Overview Panels

**Summary KPIs :**Total cost, input/output tokens, and LLM calls. Always interpret cost relative to call volume.
Cost Over Time: Tracks spend trends. Look for spikes, gradual creep, or patterns that don’t match traffic.
Cost by Model: Shows where most spend is going. Often reveals overuse of expensive models.
Input vs Output Cost: Helps diagnose cost growth:
- input ↑ → prompt/context bloat
- output ↑ → verbosity or high max_tokens
Token Usage by Model: Compares input vs output tokens. Highlights inefficient context usage.
Cost by Operation: Identifies expensive features. High cost + low calls → expensive per request , High cost + high calls → optimization opportunity
Avg Cost per Call (by Model) : Quick view for model right-sizing across tiers.
Costliest Spans (with prompt): Shows most expensive requests + user message (via VRL). Direct link from cost → cause.
Cost per User: Breaks down cost by user. Useful for billing, quotas, and abuse detection.

Instrument your LLM application to get data into OpenObserve. Post that you can make use of the prebuilt LLM Cost Monitoring Dashboard.

Cost Monitoring Dashboard

Explore: Dashboards in OpenObserve

Cost Attribution: Per User, Per Model, Per Feature

Effective LLM cost attribution transforms raw spend numbers into business-actionable decisions. The three core attribution dimensions each serve a different purpose:

Per-Model → Model Right-Sizing Decisions

When your cost-by-model chart shows spend concentrated in your most expensive tier, the question becomes: does it need to be? Model right-sizing, using the cheapest model that meets quality requirements for each task is typically the highest-ROI LLM cost optimization available.

Per-Feature → Sprint-Level Cost Budgets

Mapping spend to operation names (your span/function names) gives engineering and product teams a shared language for cost management. Once you know what each feature costs per call, you can set per-feature budgets, track against them each sprint, and prioritize prompt engineering work where it delivers the most value.

Per-User → Pricing, Quotas, and Abuse Detection

Per-user cost data enables pricing tier calibration (if P99 users cost 100× median users, flat-rate pricing may not be sustainable), per-user quota enforcement, and abuse detection users with sudden cost spikes from prompt injection attempts or automated querying are immediately visible.

Percentile Cost Analysis and Tail Risk

LLM cost distributions are right-skewed, most calls are cheap, but a small fraction are extremely expensive. The arithmetic mean is almost always misleading because it is pulled upward by outlier spans. Percentiles tell the true story.

OpenObserve supports APPROX_PERCENTILE_CONT natively in SQL, enabling P50/P75/P99 cost panels without pre-aggregation.

How to interpret the spread:

P99/P50 Ratio	Interpretation	Recommended Action
< 10×	Healthy, tight distribution	Monitor; no action required
10–25×	Normal natural variance	Set `max_tokens` per operation at p95 + 20%
25–50×	Concerning tail risk	Investigate top P99 spans in Costliest Spans table
> 50×	Critical	`max_tokens` is almost certainly unconstrained

The P99/P50 ratio is the single most actionable cost signal in the Tool Monitoring tab. A ratio above 50× almost always resolves by setting appropriate max_tokens values per operation typically reducing total output token cost by 15-40%.

Token Ratio Monitoring: The Early-Warning Signal

The input-to-output token ratio (output_tokens / input_tokens) is a second-order diagnostic signal that most teams ignore and it is often the earliest warning of a cost problem.

Why it matters: Most cost charts show you that costs went up. The token ratio shows you how it was driven by growing inputs, growing outputs, or both? This narrows the diagnosis before the cost impact compounds.

How to interpret changes:

Ratio trending down (output/input shrinking): input tokens are growing faster than output the classic signature of prompt bloat, RAG context injection creep, or growing conversation histories
Ratio trending up (output/input growing): output tokens growing faster a prompt change removed length constraints, max_tokens was loosened, or a model switch introduced more verbose completions
Sudden spike: a specific prompt or model change triggered unusually verbose output on a subset of requests cross-reference with the Costliest Spans table to find the specific prompt

Adding operation_name as a grouping dimension transforms the aggregate ratio chart into a per-feature efficiency monitor. Different operation types have naturally different healthy ratios classification should produce very short outputs relative to input, while code generation may produce outputs longer than the prompt. A ratio that moves outside its historical range for a specific operation is a regression worth investigating.

Real-Time Spend Alerting

Dashboards tell you what happened. Alerts tell you what is happening. A complete LLM spend alerting setup covers four categories:

Spend spike detection Compare current-period cost against a rolling baseline (e.g., same hour over the past 7 days). Fire when the ratio exceeds 2×. Cadence: every 5 minutes. This is the primary signal for detecting deployment-correlated cost regressions before they accumulate.

Unauthorized model detection Maintain an allowlist of approved production model identifiers. Alert immediately when any model outside the allowlist appears in the trace stream. Cadence: every 1 minute. Catches accidental deployments of preview or experimental model versions that may be 10–50× more expensive.

High-cost single span Alert when any individual LLM span exceeds a cost threshold calibrated to your application's normal range. Cadence: every 2 minutes. Surfaces runaway generations, prompt injection attempts, and misconfigured max_tokens on specific operations.

Tool failure rate spike Alert when the tool call failure rate in agentic workloads exceeds 20% over a rolling window. Cadence: every 5 minutes. Sustained tool failures amplify LLM costs through retries and cascade to context accumulation in subsequent calls.

All four alert types are SQL queries running on your trace stream in OpenObserve's scheduled alert engine, dispatching to Slack, PagerDuty, email, or any HTTP webhook.

Configuration guide: OpenObserve Alerts documentation

Conclusion

LLM costs don't spiral out of control because teams are careless they spiral because the signals needed to catch regressions early are invisible by default. Provider billing exports are delayed, aggregated, and stripped of the business context that makes cost data actionable. By the time a cost problem shows up on an invoice, it has often been compounding for days.

OpenObserve changes that equation. By treating LLM API calls as first-class telemetry events , capturing token counts, computed costs, model identifiers, and business context on every span you get a real-time, queryable picture of exactly where your AI spend is going and why. The same platform that monitors your infrastructure can now tell you which feature, which user, and which prompt is driving your LLM budget.

Get started: OpenObserve Quickstart

Frequently Asked Questions

: LLM cost monitoring is the practice of continuously measuring, attributing, and optimizing the financial costs generated by large language model API calls in production. It requires capturing input tokens, output tokens, total tokens, model identifier, and USD cost on every LLM API call.
: Provider billing dashboards give monthly aggregates by API key — they tell you how much you spent, not why. Actionable cost attribution (by feature, user, or model) requires instrumented telemetry inside your own application shipped to an observability platform.
: Yes. For self-hosted models, compute cost from GPU instance hourly rate × inference duration instead of provider token pricing, and emit it using the same cost field names. The dashboard works identically regardless of whether cost represents API charges or compute charges.
: Below 10× is excellent. 10–25× is normal. 25–50× warrants investigation. Above 50× is critical and almost always indicates max_tokens is unconstrained on at least one high-traffic operation.
: Keep a versioned pricing config file separate from your application code. Update the config when prices change — no code deployment required. Use a CI check to flag queries referencing model names not present in the config.
: At 1 million LLM calls per day with 90-day retention, storage typically runs $1–4/month on S3 standard. OpenObserve cluster compute for this scale is approximately $30–80/month.

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

Engineering

AIOpenObserveQA

AI-First, For Real: How We Turned Engineering Bottlenecks Into Agents at OpenObserve

"AI-first" is easy to say and hard to prove. At OpenObserve we ship two AI features to users - O2 Assistant and the AI SRE - and run our own engineering shop the same way, with DocGen writing our docs and the Council of Agents writing and healing our end-to-end tests. This is the story of moving both out of "a human runs this locally" and into CI, where they now fire on their own: giving agents a browser, an ingestion API, and a real running instance so they can build, seed, click, and verify the tedious work instead of asking a person to grind through it.

Best Log Analysis Tools in 2026: Complete Guide

A comprehensive comparison of the best log analysis tools in 2026, covering search, pattern detection, anomaly detection, and pipeline capabilities for engineering and SRE teams.

Simran Kumari

2026-06-26

Top 10 Kubernetes Monitoring Tools in 2026: Complete Guide

Engineering

KubernetesMonitoringObservability

Top 10 Kubernetes Monitoring Tools in 2026: Complete Guide

Compare the top 10 Kubernetes monitoring tools in 2026, including OpenObserve, Prometheus, Datadog, and more. Features, cost, and use cases for DevOps and SRE teams.

Simran Kumari

2026-06-26

MCP Server Observability: How to Trace, Monitor, and Debug Model Context Protocol Servers

Engineering

OpenTelemetryMCPDistributed Tracing

MCP Server Observability: How to Trace, Monitor, and Debug Model Context Protocol Servers

A producer-side guide to instrumenting your own MCP server with OpenTelemetry: tracing tools/call, propagating context via _meta, and deriving RED metrics.

Gorakhnath Yadav

2026-06-25

Monitoring Claude Code Usage with OpenObserve and Querying via MCP

How To

Claude CodeOpenTelemetryMCP

Monitoring Claude Code Usage with OpenObserve and Querying via MCP

Instrument Claude Code with OpenTelemetry, ship usage and cost data to OpenObserve, then query it back from Claude Code via the OpenObserve MCP server.

Gorakhnath Yadav

2026-06-25

How To

LLMObservabilityOpenTelemetry

How to Redact PII from LLM Telemetry Without Losing Debuggability

Learn how to redact PII from LLM telemetry using OpenObserve's SDR, VRL pipelines, and OTel Collector — keeping traces debuggable while staying GDPR, HIPAA, and CCPA compliant.

Simran Kumari

2026-06-24

Observability for the Claude Agent SDK: Tracing Tool Use and Extended Thinking with OpenTelemetry

How To

Claude Agent SDKOpenTelemetryObservability

Observability for the Claude Agent SDK: Tracing Tool Use and Extended Thinking with OpenTelemetry

Trace, meter, and log Claude Agent SDK agents with OpenTelemetry: tool calls, MCP servers, extended thinking, and cost, all correlated in OpenObserve.

Gorakhnath Yadav

2026-06-22

Instrumenting Amazon Bedrock, Bedrock Agents, and AgentCore with OpenTelemetry

How To

Amazon BedrockOpenTelemetryObservability

Instrumenting Amazon Bedrock, Bedrock Agents, and AgentCore with OpenTelemetry

Trace Amazon Bedrock model calls, Bedrock Agents, and AgentCore with OpenTelemetry's gen_ai.* conventions (v1.41), then track token cost in dollars in OpenObserve.

Gorakhnath Yadav

2026-06-22

Top 10 Microservices Monitoring Tools in 2026

Engineering

ComparisonsObservabilityMicroservices

Top 10 Microservices Monitoring Tools in 2026

A practical comparison of the 10 best microservices monitoring tools in 2026, including OpenObserve, Grafana LGTM, Datadog, Dynatrace, and more. Find the right fit for your stack.

Simran Kumari

2026-06-11

Microservices Monitoring: The Complete Guide to Why OpenObserve Is the Best Tool in 2026

Engineering

MicroservicesMonitoringOpenObserve

Microservices Monitoring: The Complete Guide to Why OpenObserve Is the Best Tool in 2026

Learn what microservices monitoring is, the 3 pillars of observability, and why OpenObserve is the best open-source tool for monitoring microservices in 2026. 140x lower storage costs, unified logs, metrics, and traces.

Simran Kumari

2026-06-09

LLM Cost Monitoring with OpenObserve: Track Token Usage, Control AI Spend, and Visualize Every Dollar Across Your LLM Pipelines

Ready to get started?

The LLM Cost Problem Nobody Talks About

What Is LLM Cost Monitoring?

Why OpenObserve for LLM Observability?

Understanding Your LLM Trace Stream

Core Field Schema

The LLM Cost Dashboard: What to Build and Why

Cost Overview Panels

Cost Attribution: Per User, Per Model, Per Feature

Per-Model → Model Right-Sizing Decisions

Per-Feature → Sprint-Level Cost Budgets

Per-User → Pricing, Quotas, and Abuse Detection

Percentile Cost Analysis and Tail Risk

Token Ratio Monitoring: The Early-Warning Signal

Real-Time Spend Alerting

Conclusion

Frequently Asked Questions

What is LLM cost monitoring?

Why can't I just use my LLM provider's billing dashboard?

Can OpenObserve monitor costs for self-hosted models like Ollama or vLLM?

What is a healthy P99/P50 cost ratio?

How do I handle LLM provider pricing changes?

What does it cost to run OpenObserve for LLM telemetry?

About the Author

Simran Kumari

Latest From Our Blogs

AI-First, For Real: How We Turned Engineering Bottlenecks Into Agents at OpenObserve

Best Log Analysis Tools in 2026: Complete Guide

Top 10 Kubernetes Monitoring Tools in 2026: Complete Guide

MCP Server Observability: How to Trace, Monitor, and Debug Model Context Protocol Servers

Monitoring Claude Code Usage with OpenObserve and Querying via MCP

How to Redact PII from LLM Telemetry Without Losing Debuggability

Observability for the Claude Agent SDK: Tracing Tool Use and Extended Thinking with OpenTelemetry

Instrumenting Amazon Bedrock, Bedrock Agents, and AgentCore with OpenTelemetry

Top 10 Microservices Monitoring Tools in 2026

Microservices Monitoring: The Complete Guide to Why OpenObserve Is the Best Tool in 2026