Skip to content

Get Demo Star

Promptfoo → OpenObserve

Capture evaluation run spans, per-case pass/fail results, and token usage for every Promptfoo eval. Promptfoo is an LLM evaluation framework for testing prompt quality and model behaviour. Instrument it by wrapping eval cases in manual OTel spans using the OpenAI instrumentor for the underlying model calls.

Prerequisites

Python 3.8+
An OpenObserve account (cloud or self-hosted)
Your OpenObserve organisation ID and Base64-encoded auth token
An OpenAI API key

Installation

pip install openobserve-telemetry-sdk openinference-instrumentation-openai openai opentelemetry-api python-dotenv

Configuration

Create a .env file in your project root:

OPENOBSERVE_URL=https://api.openobserve.ai/
OPENOBSERVE_ORG=your_org_id
OPENOBSERVE_AUTH_TOKEN=Basic <your_base64_token>
OPENAI_API_KEY=your-openai-api-key

Instrumentation

Call OpenAIInstrumentor().instrument() and openobserve_init() before running evaluations. Wrap each eval case in a promptfoo.eval_case span recording the prompt, output, and pass/fail result.

from dotenv import load_dotenv
load_dotenv()

from openinference.instrumentation.openai import OpenAIInstrumentor
from openobserve import openobserve_init

OpenAIInstrumentor().instrument()
openobserve_init()

from opentelemetry import trace
import os
from openai import OpenAI

tracer = trace.get_tracer(__name__)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

eval_cases = [
    {"prompt": "What is distributed tracing?", "expected_keyword": "trace"},
    {"prompt": "Explain OpenTelemetry in one sentence.", "expected_keyword": "telemetry"},
    {"prompt": "What is a span?", "expected_keyword": "span"},
]

for case in eval_cases:
    with tracer.start_as_current_span("promptfoo.eval_case") as span:
        span.set_attribute("promptfoo.prompt", case["prompt"])
        span.set_attribute("promptfoo.expected_keyword", case["expected_keyword"])
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": case["prompt"]}],
            max_tokens=100,
        )
        output = response.choices[0].message.content
        passed = case["expected_keyword"].lower() in output.lower()
        span.set_attribute("promptfoo.output", output[:300])
        span.set_attribute("promptfoo.pass", passed)
        span.set_attribute("promptfoo.prompt_tokens", response.usage.prompt_tokens)
        span.set_attribute("promptfoo.completion_tokens", response.usage.completion_tokens)
        span.set_attribute("span_status", "OK")
        print(f"{'PASS' if passed else 'FAIL'}: {output[:60]}...")

What Gets Captured

Attribute	Description
`promptfoo_prompt`	The eval prompt
`promptfoo_expected_keyword`	The keyword being asserted in the response
`promptfoo_output`	The model's response
`promptfoo_pass`	`true` if the assertion passed, `false` otherwise
`promptfoo_prompt_tokens`	Prompt tokens consumed
`promptfoo_completion_tokens`	Completion tokens consumed
`duration`	Per-case eval latency
`span_status`	`OK` or error status

Viewing Traces

Log in to OpenObserve and navigate to Traces
Filter by span name promptfoo.eval_case to see all eval cases
Filter by promptfoo.pass false to find failing cases
Sort by duration to identify slow eval cases

Promptfoo eval trace in OpenObserve

Next Steps

With Promptfoo eval cases in OpenObserve, you can track pass rates over time, alert on regression when promptfoo.pass drops, and correlate eval failures with specific prompts or models.

Read More