LLM Evaluations
LLM Evaluations let you automatically score the quality of your LLM traces using an LLM-as-judge pipeline, with results surfaced per trace inside OpenObserve.
Enterprise feature. LLM Evaluations and Eval Templates are available in OpenObserve Enterprise.
Overview
LLM Evaluations build on LLM Observability. Once your application is exporting LLM traces to OpenObserve, an evaluation pipeline can grade each LLM response against criteria such as correctness, relevance, or safety. A second LLM acts as the judge, so you get continuous, automated quality scoring without manual review or a separate evaluation tool.
This is for teams running LLM-powered applications in production who need to monitor not just latency, tokens, and cost, but also the quality of the responses their models produce. Evaluations run server-side as part of a traces pipeline, so they apply to live traffic as it arrives.
At a high level, the feature has three parts: an LLM-evaluation pipeline node that runs the judge on incoming traces, Eval Templates that define what the judge evaluates and how, and an Evaluations view in the trace detail page where you read the results.
How evaluations work
An LLM evaluation runs as a node inside a pipeline attached to a traces stream:
- You add an LLM-evaluation node to a pipeline whose source is a traces stream that contains LLM spans.
- As traces arrive, the node identifies LLM spans, optionally samples them, and sends them to the LLM judge.
- The judge scores each trace using an Eval Template. OpenObserve either uses a template you select by name or auto-resolves one by
response_type. - The evaluation results are written to a separate output stream named
<stream>_evaluations. For example, evaluating a stream calleddefaultproduces adefault_evaluationsstream. - Per-trace results appear in an Evaluations tab in the trace detail view. This tab is shown only for LLM traces that have associated evaluation records.
Only LLM spans are evaluated. The pipeline pre-filters incoming spans and keeps LLM spans, tool spans, and root spans, dropping pure infrastructure spans (such as HTTP, database, or cache spans) that would never be evaluated.
Setting up an evaluation pipeline
Prerequisites
- LLM traces are being ingested into a traces stream. See LLM Observability to instrument your application.
- An OpenObserve Enterprise instance with the LLM judge configured.
- Optionally, an Eval Template created in advance (see Eval Templates). If you do not select one, OpenObserve auto-resolves a template by
response_type.
Add an LLM-evaluation node
- Create or open a pipeline whose source is the traces stream that contains your LLM spans.
- Add an LLM Evaluation node to the pipeline.
- Configure the node parameters described below.
- Save the pipeline. The node begins evaluating traces as new data is ingested.
Node parameters
| Setting | Description | Default |
|---|---|---|
enable_llm_judge |
Turns LLM-as-judge evaluation on for the node. | true |
sampling_rate |
Fraction of LLM traces to evaluate, from 0.0 to 1.0. Use a lower rate to control cost on high-volume streams. A value of 0.0 disables sampling. |
0.01 |
llm_span_identifier |
Span field used to detect LLM spans. Only spans that contain this field with a non-empty value are treated as LLM spans. | llm_input |
eval_template |
Optional template to use for evaluation, selected by name. When set, it overrides auto-resolution by response_type. Leave empty to auto-resolve. |
None |
Tip: Start with a low
sampling_rateon busy production streams to validate your template and judge configuration before scaling up to evaluate every trace.
When you update a pipeline that contains an LLM-evaluation node, OpenObserve refreshes the running evaluation so changes such as a new eval_template take effect on the next batch of ingested traces.
Eval Templates
An Eval Template defines what the LLM judge evaluates and how it scores a response. Templates are managed from the enterprise Eval Templates tab, located alongside Functions and Enrichment Tables.
Template fields
Each template specifies:
response_type: The expected shape of the evaluation result, for example a numeric score in the range0.0to1.0. Theresponse_typeis also used to auto-resolve a template for a pipeline node when no template is explicitly selected.name: A human-readable name used to select the template from a pipeline node.description: An optional summary of what the template evaluates.content: The prompt and instructions the judge uses to perform the evaluation.dimensions: The criteria the trace is evaluated against. At least one dimension is required.
Versioning
Templates are versioned. Editing a template creates a new version and preserves the original creation time, so you can iterate on a template without losing earlier definitions. New templates start at version 1.
Per-template statistics
Each template surfaces usage statistics so you can see how it is performing:
- Total Evaluations: The number of evaluations the template has produced.
- Average Quality Score: The mean quality score across those evaluations, shown as a percentage.
A template that has been created but not yet used reports zero evaluations and a zero average quality score.
Managing templates through the API
Eval Templates can also be managed through a CRUD REST API scoped to an organization:
| Method | Path | Description |
|---|---|---|
GET |
/api/{org_id}/eval_templates |
List all templates for the organization. |
POST |
/api/{org_id}/eval_templates |
Create a new template. |
GET |
/api/{org_id}/eval_templates/{template_id} |
Retrieve a single template. |
PUT |
/api/{org_id}/eval_templates/{template_id} |
Update a template, creating a new version. |
DELETE |
/api/{org_id}/eval_templates/{template_id} |
Delete a template. |
GET |
/api/{org_id}/eval_templates/{template_id}/stats |
Get usage statistics for a template. |
A template that is referenced by a pipeline cannot be deleted. The delete request returns a conflict response listing the pipelines that still use it; update or remove those pipelines first.
Viewing results
Evaluation results are stored in the <stream>_evaluations output stream and surfaced per trace in the UI:
- Navigate to Traces in the left sidebar.
- Open an LLM trace that has been evaluated.
- Select the Evaluations tab in the trace detail view to see the scores and dimensions for that trace.
The Evaluations tab appears only for LLM traces that have associated evaluation records. Traces without evaluations, or non-LLM traces, do not show the tab.
You can also query the <stream>_evaluations stream directly from Logs or build dashboards on it to track quality trends over time.
Best practices
- Confirm LLM traces are flowing and visible under Traces before adding an evaluation node, since the judge only evaluates LLM spans.
- Begin with a low
sampling_rateto validate your template and judge, then increase it once results look correct. - Define clear, focused
dimensionsin each template so scores are easy to interpret and act on. - Use template versioning to refine evaluation prompts over time instead of recreating templates.
- Remove or update pipelines that reference a template before attempting to delete it.
Troubleshooting
The Evaluations tab does not appear on a trace
Problem: You open a trace but there is no Evaluations tab.
Solution:
- Confirm the trace is an LLM trace. Non-LLM traces never show the tab.
- Verify the evaluation pipeline is attached to the correct traces stream and is enabled.
- Check that
sampling_rateis not0.0, since a zero sampling rate means no traces are evaluated. - Confirm the
llm_span_identifiermatches a field present on your LLM spans, so the spans are recognized as LLM spans. - Look for records in the
<stream>_evaluationsoutput stream. If none exist, the pipeline has not produced evaluations yet.
A template cannot be deleted
Problem: Deleting a template returns a conflict error.
Solution:
- Note the pipeline names listed in the error message.
- Update those pipelines to use a different template, or remove their LLM-evaluation nodes.
- Retry the delete once no pipeline references the template.
Note: Evaluation cost scales with the number of traces the judge processes. Use
sampling_rateto balance coverage against the cost of running the judge model. To track the cost of the underlying LLM calls themselves, see Model Pricing.