Structured Logging in Production: The Field Guide Nobody Gave You

Simran Kumari

March 24, 2026

16 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Table of Contents

The Problem with Unstructured Logs

Every engineering team starts the same way. Someone adds a print statement. Then a logger. Then a hundred loggers. Before long, your production system is emitting thousands of lines like this per minute:

ERROR: user 12345 failed payment at 14:23
WARN:  retry attempt 2 for order 9981
INFO:  request completed in 340ms

These logs are perfectly readable to a human, staring at a terminal, debugging a single service, on a quiet afternoon.

That is exactly the wrong context for when you need them.

In production, logs need to answer questions like:

How many payment failures did we have in the last hour, broken down by error code?
What was the full request path for user 12345's failed transaction across all five services?
Is the error rate for the checkout service trending up or down over the last 30 minutes?

An unstructured log line cannot answer any of these. You cannot GROUP BY a sentence. You cannot join a free-text string to a distributed trace. You cannot build an alert on a substring match at scale without burning money on regex filters that break the moment a developer changes their log message wording.

Now look at what the same event looks like as a structured log:

{
  "timestamp": "2026-03-24T14:23:07.412Z",
  "level": "error",
  "service": "payment-service",
  "event": "payment_failed",
  "user_id": "12345",
  "order_id": "9981",
  "error_code": "CARD_DECLINED",
  "amount_usd": 149.99,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "request_id": "req_01HXYZ9ABC"
}

Every field is a column. Every column is queryable. Every log line is machine-readable and human-readable. This is where production-grade observability begins.

What Makes a Log "Good"?

A good production log has four non-negotiable properties. Most teams implement one or two. The teams that implement all four are the ones who can actually debug incidents in minutes instead of hours.

Property 1: Structured Format (JSON, Not Free-Text)

The format is the foundation. JSON is the default choice it is natively supported by every major log aggregation platform (OpenObserve, Datadog, Grafana Loki, Elasticsearch) and parseable by every language without custom tooling.

Before:

INFO payment processed for user 12345 amount 149.99

After:

{
  "level": "info",
  "event": "payment_processed",
  "user_id": "12345",
  "amount_usd": 149.99
}

Property 2: Consistent Field Names Across Services

This is the property teams most frequently skip and the one that makes or breaks cross-service debugging. When your payment service logs userId and your auth service logs user_id and your notification service logs uid, you cannot write a single query to find all logs for a given user.

Define a shared log schema and enforce it. At minimum, standardize these fields across every service:

Field	Description	Example
`service`	Originating service name	`"payment-service"`
`level`	Severity	`"error"`
`event`	Snake-case event name	`"payment_failed"`
`timestamp`	ISO 8601 UTC	`"2026-03-24T14:23:07Z"`
`trace_id`	OTel trace ID	`"4bf92f3577b34da6a3ce929d0e0e4736"`
`user_id`	User identifier (non-PII)	`"12345"`
`request_id`	Inbound request ID	`"req_01HXYZ9ABC"`

Before (payment service): userId: "12345" Before (auth service): uid: "12345" After (both services): user_id: "12345" and your query works everywhere.

Property 3: Severity Levels Used Correctly

Severity levels are only useful if they mean something. When every log line is ERROR, nothing is. Here is the working definition for production:

Level	Use it when…	Example
`DEBUG`	Detailed internal state useful during development only	Loop iteration values, internal cache hits
`INFO`	Normal operational events worth retaining	Request completed, user logged in, payment processed
`WARN`	Something unexpected happened but the system recovered	Retry succeeded after 1 failure, fallback triggered
`ERROR`	Something failed and requires investigation	Payment failed, DB connection refused, external API 5xx
`FATAL`	The process cannot continue	Missing required config, unrecoverable panic

The rule: If you wouldn't page someone for it, it is not an ERROR.

Property 4: Contextual Fields (The "Why" Fields)

A log line without context is a timestamp and a shrug. The fields that transform a log from noise into signal are the ones that tell you where in the system, for whom, and as part of what flow the event occurred.

Before:

{ "level": "error", "message": "database query failed" }

After:

{
  "level": "error",
  "event": "db_query_failed",
  "service": "order-service",
  "table": "orders",
  "operation": "SELECT",
  "duration_ms": 5021,
  "user_id": "12345",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_01HXYZ9ABC",
  "error": "connection timeout after 5000ms"
}

The second version tells you the service, operation, how long it waited, which user was affected, and connects directly to the distributed trace. You can debug the second. You can only stare at the first.

The Golden Field: `trace_id`

If you add exactly one field to every log line today, make it trace_id.

Here is why: in a distributed system, a single user-facing action say, clicking "Buy Now" triggers requests across a chain of services: API gateway → auth → inventory → payment → notification. Each service emits its own logs. Without a shared identifier, these logs are isolated islands. You see an error in the payment service but cannot reconstruct what the auth service saw, what the inventory service returned, or how long each hop took.

trace_id is the thread that stitches them together. It is the same identifier used by OpenTelemetry distributed traces. Once it is in your logs, you can pivot from "here is an error log" to "here is the full trace showing every service, every span, every duration, the exact call stack" in a single click.

Implementation by language:

Python: structlog + OTel

import structlog
from opentelemetry import trace

def add_trace_context(logger, method, event_dict):
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

structlog.configure(
    processors=[
        add_trace_context,
        structlog.processors.JSONRenderer(),
    ]
)

log = structlog.get_logger()

# Usage  trace_id is injected automatically
log.error("payment_failed", user_id="12345", error_code="CARD_DECLINED")

Node.js: pino + OTel

import pino from "pino";
import { trace } from "@opentelemetry/api";

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};
    const ctx = span.spanContext();
    return {
      trace_id: ctx.traceId,
      span_id: ctx.spanId,
    };
  },
});

// Usage  trace_id is mixed in automatically
logger.error({ user_id: "12345", error_code: "CARD_DECLINED" }, "payment_failed");

Go: zerolog + OTel

import (
    "github.com/rs/zerolog/log"
    "go.opentelemetry.io/otel/trace"
)

func logWithTrace(ctx context.Context) *zerolog.Event {
    span := trace.SpanFromContext(ctx)
    sc := span.SpanContext()
    return log.With().
        Str("trace_id", sc.TraceID().String()).
        Str("span_id", sc.SpanID().String()).
        Logger().With().Logger().Info()
}

// Usage
logWithTrace(ctx).
    Str("user_id", "12345").
    Str("error_code", "CARD_DECLINED").
    Msg("payment_failed")

Once trace_id is in your logs, the workflow becomes:

Alert fires → find the error log
Copy trace_id from the log
Pivot to distributed trace in your observability platform
See every service, every latency, the full call graph

That pivot from log to trace is the difference between "we think we know what happened" and "we know exactly what happened."

What NOT to Log: The 5 Anti-Patterns

Getting structured logging right is as much about what you stop doing as what you start. Here are the five most common mistakes and how to fix them.

Logging Everything at DEBUG in Production

The mistake:

logger.debug(f"Processing item {item_id}, current state: {json.dumps(full_state)}")
# × 50,000 items per minute

The problem: DEBUG logs in production create two crises simultaneously you pay for ingestion and storage at scale, and you bury the signal in noise. A system emitting 50,000 DEBUG lines per minute generates 72 million log lines per day. At typical SaaS log pricing, that is a surprisingly large bill for information you will never query.

The fix: Set production log level to INFO by default. Use sampling for DEBUG logs if you need them emit 1% of debug-level lines, not 100%.

# In production config
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO")  # Default to INFO, not DEBUG

Logging PII in Plain Text

The mistake:

{
  "event": "user_login",
  "email": "jane.doe@example.com",
  "ip_address": "93.184.216.34",
  "credit_card": "4111111111111111"
}

The problem: GDPR, CCPA, HIPAA, and PCI-DSS all treat logs as data stores. If you log an email address or a card number, that data is now subject to retention limits, right-to-erasure requests, and breach notification requirements in your log platform, your log archives, and every system the logs flow through.

The fix: Log identifiers, not values. Hash or mask sensitive fields at the logger level, not at the destination.

{
  "event": "user_login",
  "user_id": "12345",
  "ip_prefix": "93.184.x.x",
  "card_last4": "1111"
}

Inconsistent Field Names Across Services

The mistake:

Payment service: "userId": "12345"
Auth service: "user_id": "12345"
Notification service: "uid": "12345"

The problem: You cannot query across services for a single user. Every analytics query needs a OR clause covering three field names. Every new engineer has to learn which service uses which convention. Dashboard queries silently miss data from two-thirds of your services.

The fix: Define a canonical field schema as a shared internal package or document, enforce it in code review, and lint for it in CI.

Logging Errors Without Context

The mistake:

{
  "level": "error",
  "message": "NullPointerException at PaymentProcessor.java:147"
}

The problem: A stack trace without context is an archaeological find you know something happened, but not who was affected, what they were doing, or which request caused it. You cannot reproduce it. You cannot correlate it.

The fix: Every error log must include trace_id, request_id, and enough domain context to reproduce the failure.

{
  "level": "error",
  "event": "payment_processing_failed",
  "user_id": "12345",
  "order_id": "9981",
  "payment_method": "credit_card",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "request_id": "req_01HXYZ9ABC",
  "error": "NullPointerException",
  "stack_trace": "at PaymentProcessor.java:147..."
}

Using Log Messages as Metrics

The mistake:

# Counting errors by parsing log messages
logger.info(f"Payment processed successfully. Total today: {counter}")
logger.error(f"Payment failed. Failure count: {failure_counter}")

The problem: Logs are optimized for context and correlation, not for counting and aggregation. Deriving metrics from log parsing is expensive (you pay to ingest every log line), fragile (message format changes break dashboards), and delayed (log pipelines introduce latency).

The fix: Use OpenTelemetry metrics for anything you want to count, rate, or histogram. Logs are for events. Metrics are for measurements.

from opentelemetry import metrics

meter = metrics.get_meter("payment-service")
payment_counter = meter.create_counter("payments_processed_total")
failure_counter = meter.create_counter("payment_failures_total")

# In your payment handler:
payment_counter.add(1, {"status": "success", "method": "credit_card"})
# or
failure_counter.add(1, {"error_code": "CARD_DECLINED", "method": "credit_card"})

Your dashboards now query a metrics store (fast, cheap, designed for aggregation) instead of parsing log text (slow, expensive, fragile).

Setting Up Structured Logging with OpenTelemetry

OpenTelemetry's log signal is the cleanest way to get structured logs, trace_id injection, and OTLP export working together without building any of it yourself.

Here is the full setup for Python as a reference implementation. The pattern is identical in Node.js, Go, and Java.

Step 1: Install Dependencies

pip install \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-logging \
  structlog

Step 2: Configure the OTel SDK with Log Export

# otel_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import logging

def setup_otel(service_name: str, otlp_endpoint: str):
    # Trace provider
    tracer_provider = TracerProvider()
    tracer_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))
    )
    trace.set_tracer_provider(tracer_provider)

    # Log provider  automatically injects trace_id + span_id
    logger_provider = LoggerProvider()
    logger_provider.add_log_record_processor(
        BatchLogRecordProcessor(OTLPLogExporter(endpoint=otlp_endpoint))
    )

    # Attach to Python's standard logging
    handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
    logging.getLogger().addHandler(handler)

    return tracer_provider, logger_provider

Step 3: Wire Up structlog for JSON Output

# logging_setup.py
import structlog
import logging

def configure_logging():
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.JSONRenderer(),
        ],
        wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
        logger_factory=structlog.stdlib.LoggerFactory(),
    )

Step 4: Use It

# main.py
from otel_setup import setup_otel
from logging_setup import configure_logging
import structlog

setup_otel(
    service_name="payment-service",
    otlp_endpoint="https://your-openobserve-instance.com:4317"
)
configure_logging()

log = structlog.get_logger()

def process_payment(user_id: str, amount: float):
    # structlog + OTel SDK automatically injects trace_id, span_id
    log.info(
        "payment_processing_started",
        user_id=user_id,
        amount_usd=amount,
        service="payment-service"
    )

What You Get

Every log line shipped to OpenObserve (or any OTLP-compatible backend) now looks like this with trace_id and span_id injected automatically by the OTel SDK, with zero manual work per log call:

{
  "timestamp": "2026-03-24T14:23:07.412Z",
  "level": "info",
  "service": "payment-service",
  "event": "payment_processing_started",
  "user_id": "12345",
  "amount_usd": 149.99,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

From this point, every log is automatically linked to its distributed trace. Clicking trace_id in the log viewer opens the full trace. Clicking a span in the trace shows the correlated logs. The feedback loop that used to take hours of log archaeology now takes seconds.

Querying Your Logs with SQL

This is where structured logging pays its dividend.

Once your logs are in OpenObserve, every field is a column in a queryable table. The same log data that used to require grep | awk | sed pipelines can now be analyzed with standard SQL in real time, across billions of log lines.

Query 1: Count Errors by Service in the Last Hour

SELECT
  service,
  COUNT(*) AS error_count
FROM logs
WHERE
  level = 'error'
GROUP BY service
ORDER BY error_count DESC;

This query is impossible with unstructured logs. With structured logs and OpenObserve, it runs in seconds.

Query 2: Full Request History for a Single User

SELECT
  timestamp,
  service,
  event,
  level,
  error,
  trace_id
FROM logs
WHERE
  user_id = '12345'
ORDER BY timestamp ASC;

You now have the complete chronological audit trail for user 12345 across every service in one result set.

Query 3: Error Rate Over Time (for Dashboards and Alerts)

SELECT
  DATE_TRUNC('minute', timestamp) AS minute,
  service,
  COUNT(*) FILTER (WHERE level = 'error') AS errors,
  COUNT(*) AS total,
  ROUND(
    COUNT(*) FILTER (WHERE level = 'error') * 100.0 / COUNT(*),
    2
  ) AS error_rate_pct
FROM logs
GROUP BY minute, service
ORDER BY minute ASC;

Plot this query on a dashboard and you have a real-time error rate chart per service the kind of chart that makes the difference between catching a degradation in 2 minutes versus discovering it from a customer tweet.

Query 4: Find the Slowest Requests

SELECT
  trace_id,
  request_id,
  user_id,
  duration_ms,
  event,
  timestamp
FROM logs
WHERE
  service = 'payment-service'
  AND duration_ms IS NOT NULL
  AND timestamp >= NOW() - INTERVAL '6 hours'
ORDER BY duration_ms DESC
LIMIT 20;

Copy any trace_id from the results and open it directly in the trace viewer to understand why those requests were slow.

These queries represent a category of observability work that simply does not exist with unstructured logs. The logs become a first-class data source not a last resort.

Conclusion

Structured logging is not a nice-to-have for distributed systems it is the prerequisite for every other observability practice. Without it, you cannot correlate across services, you cannot query at scale, and you cannot pivot from an alert to a root cause without spending 45 minutes in a terminal.

The implementation path is straightforward:

Emit JSON with consistent, agreed-upon field names
Add trace_id to every log via the OTel SDK
Eliminate the five anti-patterns (DEBUG in prod, PII, inconsistent fields, contextless errors, logs-as-metrics)
Ship via OTLP to a SQL-queryable backend
Write queries, not grep commands

The teams that get this right do not just debug faster. They stop flying blind.

Want to try SQL log analytics on your own structured logs? OpenObserve offers a free tier with full SQL query support, built-in distributed trace correlation, and OTLP ingestion no credit card required.

Frequently Asked Questions

What is structured logging?

Structured logging is the practice of emitting log records as machine-readable key-value pairs (typically JSON) rather than free-text strings. Each field in a structured log is independently queryable, enabling aggregation, filtering, and correlation across services at scale.

What is the difference between structured and unstructured logging?

Unstructured logs are free-text strings like "ERROR: payment failed for user 12345". They are human-readable but cannot be queried programmatically without regex parsing. Structured logs emit the same information as discrete fields (level, event, user_id) that any SQL or log analytics engine can filter, group, and aggregate natively.

What fields should every log include?

Every production log should include at minimum: timestamp (ISO 8601 UTC), level (severity), service (originating service), event (snake-case event name), trace_id (OTel trace ID), and request_id. Domain-specific fields like user_id and order_id should be added wherever relevant.

What is a trace_id in logging?

A trace_id is a 128-bit identifier, standardized by OpenTelemetry, that is shared across every service involved in processing a single user-facing request. Adding trace_id to log lines allows you to find all logs produced by a single request across dozens of services with a single query, and to pivot directly from a log entry to the distributed trace that caused it.

What are the most common structured logging mistakes?

The five most common mistakes are: (1) enabling DEBUG logging in production, which creates cost and noise; (2) logging PII in plain text, which creates compliance exposure; (3) using inconsistent field names across services, which breaks cross-service queries; (4) logging errors without context like trace_id or request_id; and (5) using log lines as a substitute for metrics, which is expensive and fragile.

How does OpenTelemetry improve structured logging?

OpenTelemetry's log SDK automatically injects trace_id and span_id into every log record produced while a trace is active. This means logs are correlated to distributed traces with zero per-log-call effort, and the logs can be exported via OTLP to any compatible backend alongside traces and metrics giving you a unified observability signal from a single SDK.

Can I query structured logs with SQL?

Yes. Log platforms like OpenObserve expose structured log fields as SQL-queryable columns. This enables analytics that are impossible with unstructured logs: counting errors by service, computing error rates over time, finding the slowest requests, or retrieving the full log history for a specific user all with standard SQL.

How do I implement structured logging in Python?

Use structlog for structured log formatting and the opentelemetry-instrumentation-logging package to inject trace_id automatically. Configure structlog with JSONRenderer as the final processor, attach the OTel LoggingHandler to Python's root logger, and export via OTLP to your log backend.

Is structured logging expensive?

Structured logging itself has negligible overhead. The cost driver in log management is volume, not format. Structured logging typically reduces costs compared to unstructured logging because it enables you to set meaningful severity filters (eliminating DEBUG logs in production), use sampling intelligently, and route logs to the right storage tier because every log has queryable metadata.

About the Author

Simran Kumari

Passionate about observability, AI systems, and cloud-native tools. All in on DevOps and improving the developer experience.

Latest From Our Blogs

View all posts

Engineering

Best Open Source LLM Observability Tools in 2026: Complete Guide

Discover powerful open source tools for LLM observability. Track prompts, analyze outputs, reduce latency, and improve reliability of your AI applications.

Structured Logging in Production: The Field Guide Nobody Gave You

Learn how to implement structured logging in production. Improve debugging, searchability, and observability with best practices and real-world examples.

15 Essential SRE Tools in 2026: Monitoring, Alerting, Tracing & Incident Response

Discover 15 essential SRE tools in 2026 for monitoring, alerting, tracing, and incident response. Compare top platforms to improve reliability and reduce downtime.

AI Incident Management: How AI Reduces MTTR and Automates Root Cause Analysis

Discover how AI incident management transforms production operations by reducing MTTR by 90%, automating root cause analysis, and cutting alert noise by 80%. Learn how log clustering, trace correlation, and LLM-powered RCA work

How to Actually Set Meaningful SLOs (Most Teams Are Doing It Wrong)

Struggling with SLOs? Learn how to set meaningful Service Level Objectives that reflect real user impact. Avoid common mistakes, define better SLIs, and build effective SLO-based alerting.

What Is AIOps? The Complete Guide to AI-Powered IT Operations in 2026

Discover how AIOps transforms IT operations with AI-powered anomaly detection, event correlation, and automated remediation. Learn the core capabilities, use cases, and how observability data drives intelligent operations.

Mean Time to Resolution (MTTR): How to Measure It and Cut It with AI-Powered Observability

Learn how to measure and dramatically reduce Mean Time to Resolution (MTTR) using AI-powered observability. Discover the four phases that inflate MTTR and how modern teams achieve faster incident resolution with intelligent detection, triage, diagnosis, and remediation

How We Built XDrain in Rust and Why It Made Log Pattern Detection Actually Fast

We rewrote the XDrain log pattern extraction algorithm in Rust, achieving 40x performance improvements over Python. Learn how we used prefix trees, systematic sampling, and memory-bounded LRU caches to process 361,000 logs/sec in real-time.

Head-Based vs. Tail-Based Sampling: Which Should You Use and When?

Learn the difference between head-based and tail-based sampling in observability. Compare pros, cons, and use cases to choose the right strategy for tracing.

The Prometheus Cardinality Bomb: How to Prevent It Before It Blows Up

Learn what the Prometheus cardinality bomb is, why high-cardinality metrics break your monitoring, and how to detect, prevent, and fix it effectively.

Simran Kumari

2026-03-17

Structured Logging in Production: The Field Guide Nobody Gave You

Ready to get started?

The Problem with Unstructured Logs

What Makes a Log "Good"?

Property 1: Structured Format (JSON, Not Free-Text)

Property 2: Consistent Field Names Across Services

Property 3: Severity Levels Used Correctly

Property 4: Contextual Fields (The "Why" Fields)

The Golden Field: trace_id

What NOT to Log: The 5 Anti-Patterns

Logging Everything at DEBUG in Production

Logging PII in Plain Text

Inconsistent Field Names Across Services

Logging Errors Without Context

Using Log Messages as Metrics

Setting Up Structured Logging with OpenTelemetry

Step 1: Install Dependencies

Step 2: Configure the OTel SDK with Log Export

Step 3: Wire Up structlog for JSON Output

Step 4: Use It

What You Get

Querying Your Logs with SQL

Query 1: Count Errors by Service in the Last Hour

Query 2: Full Request History for a Single User

Query 3: Error Rate Over Time (for Dashboards and Alerts)

Query 4: Find the Slowest Requests

Conclusion

Frequently Asked Questions

What is structured logging?

What is the difference between structured and unstructured logging?

What fields should every log include?

What is a trace_id in logging?

What are the most common structured logging mistakes?

How does OpenTelemetry improve structured logging?

Can I query structured logs with SQL?

How do I implement structured logging in Python?

Is structured logging expensive?

About the Author

Simran Kumari

Latest From Our Blogs

Best Open Source LLM Observability Tools in 2026: Complete Guide

Structured Logging in Production: The Field Guide Nobody Gave You

15 Essential SRE Tools in 2026: Monitoring, Alerting, Tracing & Incident Response

AI Incident Management: How AI Reduces MTTR and Automates Root Cause Analysis

How to Actually Set Meaningful SLOs (Most Teams Are Doing It Wrong)

What Is AIOps? The Complete Guide to AI-Powered IT Operations in 2026

Mean Time to Resolution (MTTR): How to Measure It and Cut It with AI-Powered Observability

How We Built XDrain in Rust and Why It Made Log Pattern Detection Actually Fast

Head-Based vs. Tail-Based Sampling: Which Should You Use and When?

The Prometheus Cardinality Bomb: How to Prevent It Before It Blows Up

The Golden Field: `trace_id`