Structured Logging in Production: The Field Guide Nobody Gave You


Try OpenObserve Cloud today for more efficient and performant observability.

Every engineering team starts the same way. Someone adds a print statement. Then a logger. Then a hundred loggers. Before long, your production system is emitting thousands of lines like this per minute:
ERROR: user 12345 failed payment at 14:23
WARN: retry attempt 2 for order 9981
INFO: request completed in 340ms
These logs are perfectly readable to a human, staring at a terminal, debugging a single service, on a quiet afternoon.
That is exactly the wrong context for when you need them.
In production, logs need to answer questions like:
An unstructured log line cannot answer any of these. You cannot GROUP BY a sentence. You cannot join a free-text string to a distributed trace. You cannot build an alert on a substring match at scale without burning money on regex filters that break the moment a developer changes their log message wording.
Now look at what the same event looks like as a structured log:
{
"timestamp": "2026-03-24T14:23:07.412Z",
"level": "error",
"service": "payment-service",
"event": "payment_failed",
"user_id": "12345",
"order_id": "9981",
"error_code": "CARD_DECLINED",
"amount_usd": 149.99,
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"request_id": "req_01HXYZ9ABC"
}
Every field is a column. Every column is queryable. Every log line is machine-readable and human-readable. This is where production-grade observability begins.
A good production log has four non-negotiable properties. Most teams implement one or two. The teams that implement all four are the ones who can actually debug incidents in minutes instead of hours.
The format is the foundation. JSON is the default choice it is natively supported by every major log aggregation platform (OpenObserve, Datadog, Grafana Loki, Elasticsearch) and parseable by every language without custom tooling.
Before:
INFO payment processed for user 12345 amount 149.99
After:
{
"level": "info",
"event": "payment_processed",
"user_id": "12345",
"amount_usd": 149.99
}
This is the property teams most frequently skip and the one that makes or breaks cross-service debugging. When your payment service logs userId and your auth service logs user_id and your notification service logs uid, you cannot write a single query to find all logs for a given user.
Define a shared log schema and enforce it. At minimum, standardize these fields across every service:
| Field | Description | Example |
|---|---|---|
service |
Originating service name | "payment-service" |
level |
Severity | "error" |
event |
Snake-case event name | "payment_failed" |
timestamp |
ISO 8601 UTC | "2026-03-24T14:23:07Z" |
trace_id |
OTel trace ID | "4bf92f3577b34da6a3ce929d0e0e4736" |
user_id |
User identifier (non-PII) | "12345" |
request_id |
Inbound request ID | "req_01HXYZ9ABC" |
Before (payment service): userId: "12345" Before (auth service): uid: "12345" After (both services): user_id: "12345" and your query works everywhere.
Severity levels are only useful if they mean something. When every log line is ERROR, nothing is. Here is the working definition for production:
| Level | Use it when… | Example |
|---|---|---|
DEBUG |
Detailed internal state useful during development only | Loop iteration values, internal cache hits |
INFO |
Normal operational events worth retaining | Request completed, user logged in, payment processed |
WARN |
Something unexpected happened but the system recovered | Retry succeeded after 1 failure, fallback triggered |
ERROR |
Something failed and requires investigation | Payment failed, DB connection refused, external API 5xx |
FATAL |
The process cannot continue | Missing required config, unrecoverable panic |
The rule: If you wouldn't page someone for it, it is not an ERROR.
A log line without context is a timestamp and a shrug. The fields that transform a log from noise into signal are the ones that tell you where in the system, for whom, and as part of what flow the event occurred.
Before:
{ "level": "error", "message": "database query failed" }
After:
{
"level": "error",
"event": "db_query_failed",
"service": "order-service",
"table": "orders",
"operation": "SELECT",
"duration_ms": 5021,
"user_id": "12345",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"request_id": "req_01HXYZ9ABC",
"error": "connection timeout after 5000ms"
}
The second version tells you the service, operation, how long it waited, which user was affected, and connects directly to the distributed trace. You can debug the second. You can only stare at the first.
trace_idIf you add exactly one field to every log line today, make it trace_id.
Here is why: in a distributed system, a single user-facing action say, clicking "Buy Now" triggers requests across a chain of services: API gateway → auth → inventory → payment → notification. Each service emits its own logs. Without a shared identifier, these logs are isolated islands. You see an error in the payment service but cannot reconstruct what the auth service saw, what the inventory service returned, or how long each hop took.
trace_id is the thread that stitches them together. It is the same identifier used by OpenTelemetry distributed traces. Once it is in your logs, you can pivot from "here is an error log" to "here is the full trace showing every service, every span, every duration, the exact call stack" in a single click.
Implementation by language:
Python: structlog + OTel
import structlog
from opentelemetry import trace
def add_trace_context(logger, method, event_dict):
span = trace.get_current_span()
if span.is_recording():
ctx = span.get_span_context()
event_dict["trace_id"] = format(ctx.trace_id, "032x")
event_dict["span_id"] = format(ctx.span_id, "016x")
return event_dict
structlog.configure(
processors=[
add_trace_context,
structlog.processors.JSONRenderer(),
]
)
log = structlog.get_logger()
# Usage trace_id is injected automatically
log.error("payment_failed", user_id="12345", error_code="CARD_DECLINED")
Node.js: pino + OTel
import pino from "pino";
import { trace } from "@opentelemetry/api";
const logger = pino({
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const ctx = span.spanContext();
return {
trace_id: ctx.traceId,
span_id: ctx.spanId,
};
},
});
// Usage trace_id is mixed in automatically
logger.error({ user_id: "12345", error_code: "CARD_DECLINED" }, "payment_failed");
Go: zerolog + OTel
import (
"github.com/rs/zerolog/log"
"go.opentelemetry.io/otel/trace"
)
func logWithTrace(ctx context.Context) *zerolog.Event {
span := trace.SpanFromContext(ctx)
sc := span.SpanContext()
return log.With().
Str("trace_id", sc.TraceID().String()).
Str("span_id", sc.SpanID().String()).
Logger().With().Logger().Info()
}
// Usage
logWithTrace(ctx).
Str("user_id", "12345").
Str("error_code", "CARD_DECLINED").
Msg("payment_failed")
Once trace_id is in your logs, the workflow becomes:
trace_id from the log That pivot from log to trace is the difference between "we think we know what happened" and "we know exactly what happened."
Getting structured logging right is as much about what you stop doing as what you start. Here are the five most common mistakes and how to fix them.
The mistake:
logger.debug(f"Processing item {item_id}, current state: {json.dumps(full_state)}")
# × 50,000 items per minute
The problem: DEBUG logs in production create two crises simultaneously you pay for ingestion and storage at scale, and you bury the signal in noise. A system emitting 50,000 DEBUG lines per minute generates 72 million log lines per day. At typical SaaS log pricing, that is a surprisingly large bill for information you will never query.
The fix: Set production log level to INFO by default. Use sampling for DEBUG logs if you need them emit 1% of debug-level lines, not 100%.
# In production config
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO") # Default to INFO, not DEBUG
The mistake:
{
"event": "user_login",
"email": "jane.doe@example.com",
"ip_address": "93.184.216.34",
"credit_card": "4111111111111111"
}
The problem: GDPR, CCPA, HIPAA, and PCI-DSS all treat logs as data stores. If you log an email address or a card number, that data is now subject to retention limits, right-to-erasure requests, and breach notification requirements in your log platform, your log archives, and every system the logs flow through.
The fix: Log identifiers, not values. Hash or mask sensitive fields at the logger level, not at the destination.
{
"event": "user_login",
"user_id": "12345",
"ip_prefix": "93.184.x.x",
"card_last4": "1111"
}
The mistake:
"userId": "12345" "user_id": "12345" "uid": "12345"The problem: You cannot query across services for a single user. Every analytics query needs a OR clause covering three field names. Every new engineer has to learn which service uses which convention. Dashboard queries silently miss data from two-thirds of your services.
The fix: Define a canonical field schema as a shared internal package or document, enforce it in code review, and lint for it in CI.
The mistake:
{
"level": "error",
"message": "NullPointerException at PaymentProcessor.java:147"
}
The problem: A stack trace without context is an archaeological find you know something happened, but not who was affected, what they were doing, or which request caused it. You cannot reproduce it. You cannot correlate it.
The fix: Every error log must include trace_id, request_id, and enough domain context to reproduce the failure.
{
"level": "error",
"event": "payment_processing_failed",
"user_id": "12345",
"order_id": "9981",
"payment_method": "credit_card",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"request_id": "req_01HXYZ9ABC",
"error": "NullPointerException",
"stack_trace": "at PaymentProcessor.java:147..."
}
The mistake:
# Counting errors by parsing log messages
logger.info(f"Payment processed successfully. Total today: {counter}")
logger.error(f"Payment failed. Failure count: {failure_counter}")
The problem: Logs are optimized for context and correlation, not for counting and aggregation. Deriving metrics from log parsing is expensive (you pay to ingest every log line), fragile (message format changes break dashboards), and delayed (log pipelines introduce latency).
The fix: Use OpenTelemetry metrics for anything you want to count, rate, or histogram. Logs are for events. Metrics are for measurements.
from opentelemetry import metrics
meter = metrics.get_meter("payment-service")
payment_counter = meter.create_counter("payments_processed_total")
failure_counter = meter.create_counter("payment_failures_total")
# In your payment handler:
payment_counter.add(1, {"status": "success", "method": "credit_card"})
# or
failure_counter.add(1, {"error_code": "CARD_DECLINED", "method": "credit_card"})
Your dashboards now query a metrics store (fast, cheap, designed for aggregation) instead of parsing log text (slow, expensive, fragile).
OpenTelemetry's log signal is the cleanest way to get structured logs, trace_id injection, and OTLP export working together without building any of it yourself.
Here is the full setup for Python as a reference implementation. The pattern is identical in Node.js, Go, and Java.
pip install \
opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-logging \
structlog
# otel_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import logging
def setup_otel(service_name: str, otlp_endpoint: str):
# Trace provider
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
)
trace.set_tracer_provider(tracer_provider)
# Log provider automatically injects trace_id + span_id
logger_provider = LoggerProvider()
logger_provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint=otlp_endpoint))
)
# Attach to Python's standard logging
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)
return tracer_provider, logger_provider
# logging_setup.py
import structlog
import logging
def configure_logging():
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
logger_factory=structlog.stdlib.LoggerFactory(),
)
# main.py
from otel_setup import setup_otel
from logging_setup import configure_logging
import structlog
setup_otel(
service_name="payment-service",
otlp_endpoint="https://your-openobserve-instance.com:4317"
)
configure_logging()
log = structlog.get_logger()
def process_payment(user_id: str, amount: float):
# structlog + OTel SDK automatically injects trace_id, span_id
log.info(
"payment_processing_started",
user_id=user_id,
amount_usd=amount,
service="payment-service"
)
Every log line shipped to OpenObserve (or any OTLP-compatible backend) now looks like this with trace_id and span_id injected automatically by the OTel SDK, with zero manual work per log call:
{
"timestamp": "2026-03-24T14:23:07.412Z",
"level": "info",
"service": "payment-service",
"event": "payment_processing_started",
"user_id": "12345",
"amount_usd": 149.99,
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7"
}
From this point, every log is automatically linked to its distributed trace. Clicking trace_id in the log viewer opens the full trace. Clicking a span in the trace shows the correlated logs. The feedback loop that used to take hours of log archaeology now takes seconds.
This is where structured logging pays its dividend.
Once your logs are in OpenObserve, every field is a column in a queryable table. The same log data that used to require grep | awk | sed pipelines can now be analyzed with standard SQL in real time, across billions of log lines.
SELECT
service,
COUNT(*) AS error_count
FROM logs
WHERE
level = 'error'
GROUP BY service
ORDER BY error_count DESC;
This query is impossible with unstructured logs. With structured logs and OpenObserve, it runs in seconds.
SELECT
timestamp,
service,
event,
level,
error,
trace_id
FROM logs
WHERE
user_id = '12345'
ORDER BY timestamp ASC;
You now have the complete chronological audit trail for user 12345 across every service in one result set.
SELECT
DATE_TRUNC('minute', timestamp) AS minute,
service,
COUNT(*) FILTER (WHERE level = 'error') AS errors,
COUNT(*) AS total,
ROUND(
COUNT(*) FILTER (WHERE level = 'error') * 100.0 / COUNT(*),
2
) AS error_rate_pct
FROM logs
GROUP BY minute, service
ORDER BY minute ASC;
Plot this query on a dashboard and you have a real-time error rate chart per service the kind of chart that makes the difference between catching a degradation in 2 minutes versus discovering it from a customer tweet.
SELECT
trace_id,
request_id,
user_id,
duration_ms,
event,
timestamp
FROM logs
WHERE
service = 'payment-service'
AND duration_ms IS NOT NULL
AND timestamp >= NOW() - INTERVAL '6 hours'
ORDER BY duration_ms DESC
LIMIT 20;
Copy any trace_id from the results and open it directly in the trace viewer to understand why those requests were slow.
These queries represent a category of observability work that simply does not exist with unstructured logs. The logs become a first-class data source not a last resort.
Structured logging is not a nice-to-have for distributed systems it is the prerequisite for every other observability practice. Without it, you cannot correlate across services, you cannot query at scale, and you cannot pivot from an alert to a root cause without spending 45 minutes in a terminal.
The implementation path is straightforward:
trace_id to every log via the OTel SDK The teams that get this right do not just debug faster. They stop flying blind.
Want to try SQL log analytics on your own structured logs? OpenObserve offers a free tier with full SQL query support, built-in distributed trace correlation, and OTLP ingestion no credit card required.
Structured logging is the practice of emitting log records as machine-readable key-value pairs (typically JSON) rather than free-text strings. Each field in a structured log is independently queryable, enabling aggregation, filtering, and correlation across services at scale.
Unstructured logs are free-text strings like "ERROR: payment failed for user 12345". They are human-readable but cannot be queried programmatically without regex parsing. Structured logs emit the same information as discrete fields (level, event, user_id) that any SQL or log analytics engine can filter, group, and aggregate natively.
Every production log should include at minimum: timestamp (ISO 8601 UTC), level (severity), service (originating service), event (snake-case event name), trace_id (OTel trace ID), and request_id. Domain-specific fields like user_id and order_id should be added wherever relevant.
A trace_id is a 128-bit identifier, standardized by OpenTelemetry, that is shared across every service involved in processing a single user-facing request. Adding trace_id to log lines allows you to find all logs produced by a single request across dozens of services with a single query, and to pivot directly from a log entry to the distributed trace that caused it.
The five most common mistakes are: (1) enabling DEBUG logging in production, which creates cost and noise; (2) logging PII in plain text, which creates compliance exposure; (3) using inconsistent field names across services, which breaks cross-service queries; (4) logging errors without context like trace_id or request_id; and (5) using log lines as a substitute for metrics, which is expensive and fragile.
OpenTelemetry's log SDK automatically injects trace_id and span_id into every log record produced while a trace is active. This means logs are correlated to distributed traces with zero per-log-call effort, and the logs can be exported via OTLP to any compatible backend alongside traces and metrics giving you a unified observability signal from a single SDK.
Yes. Log platforms like OpenObserve expose structured log fields as SQL-queryable columns. This enables analytics that are impossible with unstructured logs: counting errors by service, computing error rates over time, finding the slowest requests, or retrieving the full log history for a specific user all with standard SQL.
Use structlog for structured log formatting and the opentelemetry-instrumentation-logging package to inject trace_id automatically. Configure structlog with JSONRenderer as the final processor, attach the OTel LoggingHandler to Python's root logger, and export via OTLP to your log backend.
Structured logging itself has negligible overhead. The cost driver in log management is volume, not format. Structured logging typically reduces costs compared to unstructured logging because it enables you to set meaningful severity filters (eliminating DEBUG logs in production), use sampling intelligently, and route logs to the right storage tier because every log has queryable metadata.