How We Built XDrain in Rust and Why It Made Log Pattern Detection Actually Fast

Ashish Kolhe

March 17, 2026

6 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Table of Contents

How We Built XDrain in Rust — and Why It Made Log Pattern Detection Actually Fast

Have you ever stared at a search result with millions of log lines and felt like you were just drowning in a wall of text?

For an SRE during an incident, this isn't just a data problem—it's a time-to-resolution problem. The real bottleneck in observability isn't the storage; it's the "intelligence" you can apply to that data in real-time. Instead of manually scrolling through 1,000,000 lines, you need to know instantly that an error is spiking across 50 different servers.

In our previous post, we introduced the concept of automatic log pattern extraction. But once we proved the concept, we hit a new challenge at OpenObserve: making it fast enough to run alongside every single search query without making the user wait.

The goal was to turn those 1,000,000 lines into "5 distinct templates" instantly. We didn't want pre-processing or complex schemas. We wanted it live.

Log Patterns Detection in OpenObserve

We looked at the academic gold standard, Drain, and its more stable sibling, XDrain. While the original Python implementations were brilliant for research, they struggled with the realities of production. We spent a lot of time researching whether to stick with Python and just optimize the C-extensions, but ultimately, the overhead of Python's Global Interpreter Lock (GIL) and string allocation turned a quick search into a frustrating wait.

So, we did what any performance-obsessed team does: we rewrote it in Rust.

The Core Algorithm: The "Ordering" Trap

TL;DR: Drain turns unstructured logs into structured templates by identifying common text and grouping variables (like IDs or IPs) into placeholders.

At its heart, the Drain algorithm is a prefix tree (trie). It breaks a log line into tokens and follows branches to see if a similar line has appeared before.

Visualizing the Cluster

Imagine these two log lines entering the system:

[10:00] User 123 logged in from 1.1.1.1
[10:01] User 456 logged in from 2.2.2.2

The prefix tree processes them like this:

Root
 └── User
      └── <ID> (Collapses 123 and 456)
           └── logged
                └── in
                     └── from
                          └── <IP> (Collapses 1.1.1.1 and 2.2.2.2)

Resulting Template: User <*> logged in from <*>

However, Drain has a frustrating weakness: it is incredibly sensitive to order. If logs arrive in a slightly different sequence, the tree might branch incorrectly, leaving you with two messy clusters instead of one clean one. XDrain fixes this with a "voting" mechanism: you shuffle the logs, run the parser multiple times, and let the results vote on the most stable template.

In Python, multi-pass shuffling for every search request is a performance nightmare. In Rust, we could make it fly.

Why Rust?

We didn't pick Rust just for the "cool factor." We needed three specific things to make search-time analysis viable:

Zero-Cost Abstractions: Traversing prefix trees and calculating similarity scores needs to happen without runtime baggage.
Predictable Memory: In a "noisy" log stream (think: unique UUIDs in every line), memory can explode. We needed a way to cap memory strictly so one heavy search doesn't crash the entire node.
Hardware Acceleration: We wanted to use vectorscan-rs (Intel's Vectorscan) for masking. Rust's FFI makes this low-level C integration seamless.
Ecosystem Synergy: OpenObserve is built from the ground up in Rust. It's the language we know best, and by writing XDrain in the same stack, we avoid "impedance mismatch"—the library can be embedded directly into our query engine for maximum performance without any serialization overhead.

Engineering for Scale: The Memory Safety Valve

One of the biggest concerns with search-time analysis is "noisy neighbors." To prevent memory bloat, we used a dual-mode cluster storage system via the lru crate. If a log stream gets too chaotic, older patterns get evicted automatically.

enum ClusterStorage {
    Unbounded(HashMap<ClusterId, Cluster>),
    Bounded(LruCache<ClusterId, Cluster>),
}

By using an enum, we let the type system enforce memory limits at compile time. We don't need to pepper the code with if/else checks everywhere; the code path remains clean, and the performance overhead is virtually zero.

Systematic Sampling: Seeing Across the Whole Dataset

One thing that kept us up at night was the "First-N" trap. If you only process the first 10,000 logs and call it a day, you're biased toward the beginning of your time range. You'll completely miss error patterns that only show up at the end of the query window.

We implemented systematic sampling in our PatternAccumulator. As more logs arrive, we calculate a sampling interval that grows, effectively grabbing a representative sample across the entire result set.

let sampling_interval = if total_count > max_capacity {
    total_count / max_capacity
} else {
    1
};

if total_count % sampling_interval == 0 {
    process_log(log);
}

If you have 1,000,000 logs but a 10k cap, this logic ensures you're seeing a "slice" of the whole million. It gives you a representative skeleton of the data rather than just the first 1%.

The Results: 40x Faster

We benchmarked the Rust implementation against the original Python logic using mixed logs (HDFS, Apache, Syslog):

Mode	Throughput
Python XDrain	~9,000 logs/sec
Rust XDrain (Single Pass)	~361,000 logs/sec
Rust XDrain (Homogeneous logs)	~676,000 logs/sec
Rust XDrain (Voting/3 Shuffles)	~60,000 logs/sec

On average, the Rust implementation is 40x faster. While the "Voting" mode is significantly slower than the single-pass mode, it still clears 60k logs/sec. That turns a 20-second wait into a sub-second response—which is the difference between a tool being "useful" and being "annoying."

Lessons Learned

Type Safety is a Feature: Using enums for configuration instead of boolean flags made the logic significantly more robust. The compiler simply won't let you forget an edge case.
LRU is Mandatory: In the real world, log data is messy. Without bounded storage, a noisy stream will eat your RAM.
Strategy > Size: How you sample matters more than how much you sample. Systematic sampling provides much better patterns than simple truncation.

What's Next?

We are already working on ways to improve the "Voting" performance further. We're currently looking into Parallel Voting using Rayon to run the shuffle passes concurrently. This should bring multi-pass speeds much closer to our single-pass benchmarks.

If you're building observability tools, check it out—and feel free to drop a comment if you want to geek out over prefix tree implementations!

Take the Next Step

New to OpenObserve? Register for our Getting Started Workshop for a quick walkthrough.

Try OpenObserve: Download for self-hosting or sign up for OpenObserve Cloud with a 14-day free trial.

About the Author

Ashish Kolhe

Ashish leads Engineering at OpenObserve. Ashish is obsessed with building high performance systems with simplicity in mind. He has vast experience in multiple disciplines like streaming, analytics, big data and more.

Latest From Our Blogs

View all posts

How to

Observability

Add Full Observability to a New Microservice in Under 30 Minutes

Learn how to set up logs, metrics, and traces for a new microservice in under 30 minutes. A step-by-step guide to achieving full observability quickly and efficiently.

Simran Kumari

2026-04-03

How to

Detecting Frustrated Users Before They Churn: A Deep Dive into OpenObserve's Frustration Signals

Learn how OpenObserve's RUM module automatically detects rage clicks, dead clicks, and error clicks turning invisible UX pain into actionable signals you can see in session replays, query with SQL, and alert on.

AI Anomaly Detection: Catch Issues Traditional Alerts Miss

Complete guide to AI anomaly detection in observability. Discover how machine learning algorithms detect unusual patterns, handle seasonality, and catch issues traditional thresholds miss.

Manas Sharma

2026-04-03

Announcement

OpenChoreo Chooses OpenObserve for Cloud-Native Logging and Tracing

When the OpenChoreo team needed an observability backend for their CNCF sandbox Internal Developer Platform, they chose OpenObserve. Here's why and what it means for Kubernetes teams everywhere.

Simran Kumari

2026-04-01

How to

AI Agent Monitoring: How to Observe Autonomous AI Agents in Production

Learn how to monitor autonomous AI agents in production using observability best practices. Track agent behavior, logs, traces, and performance with tools like OpenTelemetry to ensure reliability, transparency, and control at scale.

Simran Kumari

2026-03-30

Implementing Distributed Tracing in a Java Application with OpenObserve

How to

OpentelemetryApplication

Implementing Distributed Tracing in a Java Application with OpenObserve

Learn how to implement distributed tracing in a Java Spring Boot microservices application using the OpenTelemetry Java Agent and OpenObserve. Covers zero-code auto-instrumentation, JVM metrics, cross-service trace propagation, flamegraphs, and Gantt charts , with working source code and curl examples.

Simran Kumari

2026-03-25

Engineering

Catch Anomalies Before They Become Incidents: Inside OpenObserve's Built-In Detection Engine

Explore how OpenObserve detects anomalies in logs, metrics, and traces to help SREs identify issues early and take action before incidents escalate.

Bhargav Patel,Loakesh Indiran

2026-03-25

How to

AIObservability

AI-Assisted Monitoring via MCP

Learn how AI-assisted monitoring using MCP enhances observability with intelligent alerts, anomaly detection, and automated insights for faster incident response.

Simran Kumari

2026-03-25

Engineering

Best Open Source LLM Observability Tools in 2026: Complete Guide

Discover powerful open source tools for LLM observability. Track prompts, analyze outputs, reduce latency, and improve reliability of your AI applications.

Structured Logging in Production: The Field Guide Nobody Gave You

Learn how to implement structured logging in production. Improve debugging, searchability, and observability with best practices and real-world examples.

Simran Kumari

2026-03-24