vLLM → OpenObserve

Automatically capture token usage, latency, prompt text, and completion output for every vLLM inference call. vLLM serves an OpenAI-compatible API, so instrumentation uses openinference-instrumentation-openai pointed at your local vLLM server. No code changes to the server are needed.

Prerequisites

Python 3.10+
vLLM server running locally
An OpenObserve account (cloud or self-hosted)
Your OpenObserve organisation ID and Base64-encoded auth token

Installation

pip install openobserve-telemetry-sdk openinference-instrumentation-openai openai python-dotenv

Install and start vLLM with a model of your choice:

pip install vllm
vllm serve facebook/opt-125m --port 8001

Configuration

Create a .env file in your project root:

OPENOBSERVE_URL=https://api.openobserve.ai/
OPENOBSERVE_ORG=your_org_id
OPENOBSERVE_AUTH_TOKEN=Basic <your_base64_token>
VLLM_BASE_URL=http://localhost:8001/v1

Instrumentation

Call OpenAIInstrumentor().instrument() before creating the OpenAI client. Point the client at your vLLM server using the base_url parameter.

from dotenv import load_dotenv
load_dotenv()

from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()

from openobserve import openobserve_init
openobserve_init(resource_attributes={"service.name": "vllm"})

import os
from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url=os.environ.get("VLLM_BASE_URL", "http://localhost:8001/v1"),
)

models = client.models.list()
model_name = models.data[0].id

response = client.completions.create(
    model=model_name,
    prompt="Explain distributed tracing in one sentence.",
    max_tokens=20,
)
print(response.choices[0].text)

For models with a chat template (e.g. Llama, Mistral, Qwen), use client.chat.completions.create() with a messages list instead.

What Gets Captured

Attribute	Example Value
`operation_name`	`Completion`
`llm_model_name`	`facebook/opt-125m`
`llm_system`	`openai`
`llm_token_count_prompt`	`7`
`llm_token_count_completion`	`20`
`llm_token_count_total`	`27`
`llm_prompts_0_prompt_text`	The prompt sent to the model
`llm_choices_0_completion_text`	The generated response text
`llm_invocation_parameters`	JSON with model and max_tokens
`openinference_span_kind`	`LLM`
`span_status`	`OK` on success, `ERROR` on failure
`duration`	End-to-end request latency in microseconds

Viewing Traces

Log in to OpenObserve and navigate to Traces
Filter by service_name = vllm to see all inference calls
Click any span to inspect llm_prompts_0_prompt_text, token counts, and llm_choices_0_completion_text
Filter by span_status = ERROR to find failed requests
Sort by duration descending to identify the slowest inference calls

vLLM traces in OpenObserve

Next Steps

With vLLM instrumented, every inference call is recorded in OpenObserve. From here you can measure throughput, compare latency across models, and monitor token usage trends over time.