Ollama → OpenObserve
Automatically capture token usage, latency, and model metadata for every Ollama inference call in your Python application — no cloud API key required.
Prerequisites
- Python 3.8+
- Ollama running locally (default:
http://localhost:11434) - An OpenObserve account (cloud or self-hosted)
- Your OpenObserve organisation ID and Base64-encoded auth token
Pull a model before running the examples:
Installation
Configuration
Create a .env file in your project root:
# OpenObserve instance URL
# Default for self-hosted: http://localhost:5080
OPENOBSERVE_URL=https://api.openobserve.ai/
# Your OpenObserve organisation slug or ID
OPENOBSERVE_ORG=your_org_id
# Basic auth token — Base64-encoded "email:password"
OPENOBSERVE_AUTH_TOKEN="Basic <your_base64_token>"
# Ollama base URL (change if Ollama is running on a different host)
OLLAMA_HOST=http://localhost:11434
Instrumentation
Call OllamaInstrumentor().instrument() before any Ollama client is created.
from opentelemetry.instrumentation.ollama import OllamaInstrumentor
from openobserve import openobserve_init
# Instrument before importing the Ollama client
OllamaInstrumentor().instrument()
openobserve_init()
import ollama
# Chat completion
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Explain distributed tracing in one sentence."}],
)
print(response["message"]["content"])
Streaming
stream = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Write a haiku about observability."}],
stream=True,
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
Using the OpenAI-compatible endpoint
If you use Ollama's OpenAI-compatible API (/v1/chat/completions), instrument it with the OpenAI instrumentor instead:
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from openobserve import openobserve_init
from openai import OpenAI
OpenAIInstrumentor().instrument()
openobserve_init()
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
What Gets Captured
| Attribute | Description |
|---|---|
gen_ai_request_model |
Model name (e.g. llama3.2) |
gen_ai_usage_input_tokens |
Tokens in the prompt |
gen_ai_usage_output_tokens |
Tokens in the response |
llm_usage_tokens_total |
Total tokens consumed |
llm_usage_cost_input |
Estimated input cost in USD |
llm_usage_cost_output |
Estimated output cost in USD |
gen_ai_system |
ollama |
duration |
End-to-end request latency |
error |
Exception details if the request failed |
Viewing Traces
- Log in to OpenObserve and navigate to Traces
- Click any span to inspect token counts and full request metadata
- Use
gen_ai_request_modelto compare latency across different locally-hosted models
Next Steps
With Ollama instrumented, every local inference call is automatically recorded in OpenObserve — no cloud API key required. From here you can compare token throughput across models, monitor latency for different prompt sizes, and benchmark locally-hosted models side by side.