Structured Logging in Production: The Field Guide Nobody Gave You

Simran Kumari
Simran Kumari
March 24, 2026
16 min read
Don’t forget to share!
TwitterLinkedInFacebook

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Table of Contents
Untitled design (4).png

The Problem with Unstructured Logs

Every engineering team starts the same way. Someone adds a print statement. Then a logger. Then a hundred loggers. Before long, your production system is emitting thousands of lines like this per minute:

ERROR: user 12345 failed payment at 14:23
WARN:  retry attempt 2 for order 9981
INFO:  request completed in 340ms

These logs are perfectly readable to a human, staring at a terminal, debugging a single service, on a quiet afternoon.

That is exactly the wrong context for when you need them.

In production, logs need to answer questions like:

  • How many payment failures did we have in the last hour, broken down by error code?
  • What was the full request path for user 12345's failed transaction across all five services?
  • Is the error rate for the checkout service trending up or down over the last 30 minutes?

An unstructured log line cannot answer any of these. You cannot GROUP BY a sentence. You cannot join a free-text string to a distributed trace. You cannot build an alert on a substring match at scale without burning money on regex filters that break the moment a developer changes their log message wording.

Now look at what the same event looks like as a structured log:

{
  "timestamp": "2026-03-24T14:23:07.412Z",
  "level": "error",
  "service": "payment-service",
  "event": "payment_failed",
  "user_id": "12345",
  "order_id": "9981",
  "error_code": "CARD_DECLINED",
  "amount_usd": 149.99,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "request_id": "req_01HXYZ9ABC"
}

Every field is a column. Every column is queryable. Every log line is machine-readable and human-readable. This is where production-grade observability begins.

What Makes a Log "Good"?

A good production log has four non-negotiable properties. Most teams implement one or two. The teams that implement all four are the ones who can actually debug incidents in minutes instead of hours.

Property 1: Structured Format (JSON, Not Free-Text)

The format is the foundation. JSON is the default choice it is natively supported by every major log aggregation platform (OpenObserve, Datadog, Grafana Loki, Elasticsearch) and parseable by every language without custom tooling.

Before:

INFO payment processed for user 12345 amount 149.99

After:

{
  "level": "info",
  "event": "payment_processed",
  "user_id": "12345",
  "amount_usd": 149.99
}

Property 2: Consistent Field Names Across Services

This is the property teams most frequently skip and the one that makes or breaks cross-service debugging. When your payment service logs userId and your auth service logs user_id and your notification service logs uid, you cannot write a single query to find all logs for a given user.

Define a shared log schema and enforce it. At minimum, standardize these fields across every service:

Field Description Example
service Originating service name "payment-service"
level Severity "error"
event Snake-case event name "payment_failed"
timestamp ISO 8601 UTC "2026-03-24T14:23:07Z"
trace_id OTel trace ID "4bf92f3577b34da6a3ce929d0e0e4736"
user_id User identifier (non-PII) "12345"
request_id Inbound request ID "req_01HXYZ9ABC"

Before (payment service): userId: "12345" Before (auth service): uid: "12345" After (both services): user_id: "12345" and your query works everywhere.

Property 3: Severity Levels Used Correctly

Severity levels are only useful if they mean something. When every log line is ERROR, nothing is. Here is the working definition for production:

Level Use it when… Example
DEBUG Detailed internal state useful during development only Loop iteration values, internal cache hits
INFO Normal operational events worth retaining Request completed, user logged in, payment processed
WARN Something unexpected happened but the system recovered Retry succeeded after 1 failure, fallback triggered
ERROR Something failed and requires investigation Payment failed, DB connection refused, external API 5xx
FATAL The process cannot continue Missing required config, unrecoverable panic

The rule: If you wouldn't page someone for it, it is not an ERROR.

Property 4: Contextual Fields (The "Why" Fields)

A log line without context is a timestamp and a shrug. The fields that transform a log from noise into signal are the ones that tell you where in the system, for whom, and as part of what flow the event occurred.

Before:

{ "level": "error", "message": "database query failed" }

After:

{
  "level": "error",
  "event": "db_query_failed",
  "service": "order-service",
  "table": "orders",
  "operation": "SELECT",
  "duration_ms": 5021,
  "user_id": "12345",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_01HXYZ9ABC",
  "error": "connection timeout after 5000ms"
}

The second version tells you the service, operation, how long it waited, which user was affected, and connects directly to the distributed trace. You can debug the second. You can only stare at the first.

The Golden Field: trace_id

If you add exactly one field to every log line today, make it trace_id.

Here is why: in a distributed system, a single user-facing action say, clicking "Buy Now" triggers requests across a chain of services: API gateway → auth → inventory → payment → notification. Each service emits its own logs. Without a shared identifier, these logs are isolated islands. You see an error in the payment service but cannot reconstruct what the auth service saw, what the inventory service returned, or how long each hop took.

trace_id is the thread that stitches them together. It is the same identifier used by OpenTelemetry distributed traces. Once it is in your logs, you can pivot from "here is an error log" to "here is the full trace showing every service, every span, every duration, the exact call stack" in a single click.

Implementation by language:

Python: structlog + OTel

import structlog
from opentelemetry import trace

def add_trace_context(logger, method, event_dict):
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

structlog.configure(
    processors=[
        add_trace_context,
        structlog.processors.JSONRenderer(),
    ]
)

log = structlog.get_logger()

# Usage  trace_id is injected automatically
log.error("payment_failed", user_id="12345", error_code="CARD_DECLINED")

Node.js: pino + OTel

import pino from "pino";
import { trace } from "@opentelemetry/api";

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const ctx = span.spanContext();
    return {
      trace_id: ctx.traceId,
      span_id: ctx.spanId,
    };
  },
});

// Usage  trace_id is mixed in automatically
logger.error({ user_id: "12345", error_code: "CARD_DECLINED" }, "payment_failed");

Go: zerolog + OTel

import (
    "github.com/rs/zerolog/log"
    "go.opentelemetry.io/otel/trace"
)

func logWithTrace(ctx context.Context) *zerolog.Event {
    span := trace.SpanFromContext(ctx)
    sc := span.SpanContext()
    return log.With().
        Str("trace_id", sc.TraceID().String()).
        Str("span_id", sc.SpanID().String()).
        Logger().With().Logger().Info()
}

// Usage
logWithTrace(ctx).
    Str("user_id", "12345").
    Str("error_code", "CARD_DECLINED").
    Msg("payment_failed")

Once trace_id is in your logs, the workflow becomes:

  1. Alert fires → find the error log
  2. Copy trace_id from the log
  3. Pivot to distributed trace in your observability platform
  4. See every service, every latency, the full call graph

That pivot from log to trace is the difference between "we think we know what happened" and "we know exactly what happened."

What NOT to Log: The 5 Anti-Patterns

Getting structured logging right is as much about what you stop doing as what you start. Here are the five most common mistakes and how to fix them.

Logging Everything at DEBUG in Production

The mistake:

logger.debug(f"Processing item {item_id}, current state: {json.dumps(full_state)}")
# × 50,000 items per minute

The problem: DEBUG logs in production create two crises simultaneously you pay for ingestion and storage at scale, and you bury the signal in noise. A system emitting 50,000 DEBUG lines per minute generates 72 million log lines per day. At typical SaaS log pricing, that is a surprisingly large bill for information you will never query.

The fix: Set production log level to INFO by default. Use sampling for DEBUG logs if you need them emit 1% of debug-level lines, not 100%.

# In production config
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")  # Default to INFO, not DEBUG

Logging PII in Plain Text

The mistake:

{
  "event": "user_login",
  "email": "jane.doe@example.com",
  "ip_address": "93.184.216.34",
  "credit_card": "4111111111111111"
}

The problem: GDPR, CCPA, HIPAA, and PCI-DSS all treat logs as data stores. If you log an email address or a card number, that data is now subject to retention limits, right-to-erasure requests, and breach notification requirements in your log platform, your log archives, and every system the logs flow through.

The fix: Log identifiers, not values. Hash or mask sensitive fields at the logger level, not at the destination.

{
  "event": "user_login",
  "user_id": "12345",
  "ip_prefix": "93.184.x.x",
  "card_last4": "1111"
}

Inconsistent Field Names Across Services

The mistake:

  • Payment service: "userId": "12345"
  • Auth service: "user_id": "12345"
  • Notification service: "uid": "12345"

The problem: You cannot query across services for a single user. Every analytics query needs a OR clause covering three field names. Every new engineer has to learn which service uses which convention. Dashboard queries silently miss data from two-thirds of your services.

The fix: Define a canonical field schema as a shared internal package or document, enforce it in code review, and lint for it in CI.

Logging Errors Without Context

The mistake:

{
  "level": "error",
  "message": "NullPointerException at PaymentProcessor.java:147"
}

The problem: A stack trace without context is an archaeological find you know something happened, but not who was affected, what they were doing, or which request caused it. You cannot reproduce it. You cannot correlate it.

The fix: Every error log must include trace_id, request_id, and enough domain context to reproduce the failure.

{
  "level": "error",
  "event": "payment_processing_failed",
  "user_id": "12345",
  "order_id": "9981",
  "payment_method": "credit_card",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_01HXYZ9ABC",
  "error": "NullPointerException",
  "stack_trace": "at PaymentProcessor.java:147..."
}

Using Log Messages as Metrics

The mistake:

# Counting errors by parsing log messages
logger.info(f"Payment processed successfully. Total today: {counter}")
logger.error(f"Payment failed. Failure count: {failure_counter}")

The problem: Logs are optimized for context and correlation, not for counting and aggregation. Deriving metrics from log parsing is expensive (you pay to ingest every log line), fragile (message format changes break dashboards), and delayed (log pipelines introduce latency).

The fix: Use OpenTelemetry metrics for anything you want to count, rate, or histogram. Logs are for events. Metrics are for measurements.

from opentelemetry import metrics

meter = metrics.get_meter("payment-service")
payment_counter = meter.create_counter("payments_processed_total")
failure_counter = meter.create_counter("payment_failures_total")

# In your payment handler:
payment_counter.add(1, {"status": "success", "method": "credit_card"})
# or
failure_counter.add(1, {"error_code": "CARD_DECLINED", "method": "credit_card"})

Your dashboards now query a metrics store (fast, cheap, designed for aggregation) instead of parsing log text (slow, expensive, fragile).

Setting Up Structured Logging with OpenTelemetry

OpenTelemetry's log signal is the cleanest way to get structured logs, trace_id injection, and OTLP export working together without building any of it yourself.

Here is the full setup for Python as a reference implementation. The pattern is identical in Node.js, Go, and Java.

Step 1: Install Dependencies

pip install \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-logging \
  structlog

Step 2: Configure the OTel SDK with Log Export

# otel_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import logging

def setup_otel(service_name: str, otlp_endpoint: str):
    # Trace provider
    tracer_provider = TracerProvider()
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
    )
    trace.set_tracer_provider(tracer_provider)

    # Log provider  automatically injects trace_id + span_id
    logger_provider = LoggerProvider()
    logger_provider.add_log_record_processor(
        BatchLogRecordProcessor(OTLPLogExporter(endpoint=otlp_endpoint))
    )

    # Attach to Python's standard logging
    handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
    logging.getLogger().addHandler(handler)

    return tracer_provider, logger_provider

Step 3: Wire Up structlog for JSON Output

# logging_setup.py
import structlog
import logging

def configure_logging():
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
        logger_factory=structlog.stdlib.LoggerFactory(),
    )

Step 4: Use It

# main.py
from otel_setup import setup_otel
from logging_setup import configure_logging
import structlog

setup_otel(
    service_name="payment-service",
    otlp_endpoint="https://your-openobserve-instance.com:4317"
)
configure_logging()

log = structlog.get_logger()

def process_payment(user_id: str, amount: float):
    # structlog + OTel SDK automatically injects trace_id, span_id
    log.info(
        "payment_processing_started",
        user_id=user_id,
        amount_usd=amount,
        service="payment-service"
    )

What You Get

Every log line shipped to OpenObserve (or any OTLP-compatible backend) now looks like this with trace_id and span_id injected automatically by the OTel SDK, with zero manual work per log call:

{
  "timestamp": "2026-03-24T14:23:07.412Z",
  "level": "info",
  "service": "payment-service",
  "event": "payment_processing_started",
  "user_id": "12345",
  "amount_usd": 149.99,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

From this point, every log is automatically linked to its distributed trace. Clicking trace_id in the log viewer opens the full trace. Clicking a span in the trace shows the correlated logs. The feedback loop that used to take hours of log archaeology now takes seconds.

Querying Your Logs with SQL

This is where structured logging pays its dividend.

Once your logs are in OpenObserve, every field is a column in a queryable table. The same log data that used to require grep | awk | sed pipelines can now be analyzed with standard SQL in real time, across billions of log lines.

Query 1: Count Errors by Service in the Last Hour

SELECT
  service,
  COUNT(*) AS error_count
FROM logs
WHERE
  level = 'error'
GROUP BY service
ORDER BY error_count DESC;

This query is impossible with unstructured logs. With structured logs and OpenObserve, it runs in seconds.

Query 2: Full Request History for a Single User

SELECT
  timestamp,
  service,
  event,
  level,
  error,
  trace_id
FROM logs
WHERE
  user_id = '12345'
ORDER BY timestamp ASC;

You now have the complete chronological audit trail for user 12345 across every service in one result set.

Query 3: Error Rate Over Time (for Dashboards and Alerts)

SELECT
  DATE_TRUNC('minute', timestamp) AS minute,
  service,
  COUNT(*) FILTER (WHERE level = 'error') AS errors,
  COUNT(*) AS total,
  ROUND(
    COUNT(*) FILTER (WHERE level = 'error') * 100.0 / COUNT(*),
    2
  ) AS error_rate_pct
FROM logs
GROUP BY minute, service
ORDER BY minute ASC;

Plot this query on a dashboard and you have a real-time error rate chart per service the kind of chart that makes the difference between catching a degradation in 2 minutes versus discovering it from a customer tweet.

Query 4: Find the Slowest Requests

SELECT
  trace_id,
  request_id,
  user_id,
  duration_ms,
  event,
  timestamp
FROM logs
WHERE
  service = 'payment-service'
  AND duration_ms IS NOT NULL
  AND timestamp >= NOW() - INTERVAL '6 hours'
ORDER BY duration_ms DESC
LIMIT 20;

Copy any trace_id from the results and open it directly in the trace viewer to understand why those requests were slow.

These queries represent a category of observability work that simply does not exist with unstructured logs. The logs become a first-class data source not a last resort.

Conclusion

Structured logging is not a nice-to-have for distributed systems it is the prerequisite for every other observability practice. Without it, you cannot correlate across services, you cannot query at scale, and you cannot pivot from an alert to a root cause without spending 45 minutes in a terminal.

The implementation path is straightforward:

  1. Emit JSON with consistent, agreed-upon field names
  2. Add trace_id to every log via the OTel SDK
  3. Eliminate the five anti-patterns (DEBUG in prod, PII, inconsistent fields, contextless errors, logs-as-metrics)
  4. Ship via OTLP to a SQL-queryable backend
  5. Write queries, not grep commands

The teams that get this right do not just debug faster. They stop flying blind.

Want to try SQL log analytics on your own structured logs? OpenObserve offers a free tier with full SQL query support, built-in distributed trace correlation, and OTLP ingestion no credit card required.

Frequently Asked Questions

What is structured logging?

Structured logging is the practice of emitting log records as machine-readable key-value pairs (typically JSON) rather than free-text strings. Each field in a structured log is independently queryable, enabling aggregation, filtering, and correlation across services at scale.

What is the difference between structured and unstructured logging?

Unstructured logs are free-text strings like "ERROR: payment failed for user 12345". They are human-readable but cannot be queried programmatically without regex parsing. Structured logs emit the same information as discrete fields (level, event, user_id) that any SQL or log analytics engine can filter, group, and aggregate natively.

What fields should every log include?

Every production log should include at minimum: timestamp (ISO 8601 UTC), level (severity), service (originating service), event (snake-case event name), trace_id (OTel trace ID), and request_id. Domain-specific fields like user_id and order_id should be added wherever relevant.

What is a trace_id in logging?

A trace_id is a 128-bit identifier, standardized by OpenTelemetry, that is shared across every service involved in processing a single user-facing request. Adding trace_id to log lines allows you to find all logs produced by a single request across dozens of services with a single query, and to pivot directly from a log entry to the distributed trace that caused it.

What are the most common structured logging mistakes?

The five most common mistakes are: (1) enabling DEBUG logging in production, which creates cost and noise; (2) logging PII in plain text, which creates compliance exposure; (3) using inconsistent field names across services, which breaks cross-service queries; (4) logging errors without context like trace_id or request_id; and (5) using log lines as a substitute for metrics, which is expensive and fragile.

How does OpenTelemetry improve structured logging?

OpenTelemetry's log SDK automatically injects trace_id and span_id into every log record produced while a trace is active. This means logs are correlated to distributed traces with zero per-log-call effort, and the logs can be exported via OTLP to any compatible backend alongside traces and metrics giving you a unified observability signal from a single SDK.

Can I query structured logs with SQL?

Yes. Log platforms like OpenObserve expose structured log fields as SQL-queryable columns. This enables analytics that are impossible with unstructured logs: counting errors by service, computing error rates over time, finding the slowest requests, or retrieving the full log history for a specific user all with standard SQL.

How do I implement structured logging in Python?

Use structlog for structured log formatting and the opentelemetry-instrumentation-logging package to inject trace_id automatically. Configure structlog with JSONRenderer as the final processor, attach the OTel LoggingHandler to Python's root logger, and export via OTLP to your log backend.

Is structured logging expensive?

Structured logging itself has negligible overhead. The cost driver in log management is volume, not format. Structured logging typically reduces costs compared to unstructured logging because it enables you to set meaningful severity filters (eliminating DEBUG logs in production), use sampling intelligently, and route logs to the right storage tier because every log has queryable metadata.

About the Author

Simran Kumari

Simran Kumari

LinkedIn

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts