How Jidu Scaled Smart Car Tracing with OpenObserve

Nitya Timalsina
Nitya Timalsina
February 26, 2025
8 min read
Don’t forget to share!
TwitterLinkedInFacebook

Table of Contents

jidu.png

Jidu, the technology company behind Jiyue’s autonomous driving systems, faced growing challenges in maintaining observability across their distributed systems. As telemetry data from their vehicles increased exponentially, their existing Elasticsearch-based observability stack was no longer sufficient.

The team identified three critical issues:

1. Incomplete Observability

Elasticsearch’s scalability limitations forced Jidu to sample only 10% of traces, leaving 90% of application behavior hidden. This lack of visibility made it difficult to identify performance bottlenecks and diagnose issues effectively.

2. Performance Bottlenecks

Despite allocating 24 CPU cores, 96GB of RAM, and 12TB of storage, Elasticsearch struggled with long-term queries and statistical analysis. Engineers frequently encountered memory errors and timeouts, slowing down debugging efforts.

3. High Operational Costs

Storing 1TB/day of sampled data resulted in significant infrastructure costs. Scaling to handle full-fidelity tracing would have been prohibitively expensive.

For Jidu, these challenges weren’t just technical—they directly impacted the stability and reliability of their autonomous driving systems, posing risks to customer satisfaction and operational efficiency.

The Solution: Why Jidu Chose OpenObserve

To address these challenges, Jidu migrated to OpenObserve, an open-source observability platform designed for high-performance monitoring at scale. OpenObserve offered a unified solution for logs, metrics, distributed tracing, and front-end monitoring—all while significantly reducing resource consumption and costs.

Key Benefits of OpenObserve for Jidu

  1. Full-Fidelity Tracing: OpenObserve enabled the collection of 100% of traces, providing complete visibility into application behavior without sampling.
  2. Efficient Resource Usage: OpenObserve’s architecture reduced CPU usage to less than 1 core per node and kept memory usage below 6GB—even during peak loads.
  3. Cost Savings Through Compression: Daily storage requirements dropped from 1TB to 0.3TB—despite a 10x increase in ingested data—thanks to advanced data compression techniques.
  4. Real-Time Insights: Engineers could now perform long-term queries and statistical analysis without timeouts or memory errors.

Jidu’s Observability Transformation with O2

Jidu’s transition to OpenObserve (O2) marked a significant shift in how the company managed observability for its autonomous driving systems. By addressing long-standing challenges and introducing innovative solutions, O2 transformed Jidu’s operations, making their systems more efficient, reliable, and scalable. Below is a detailed breakdown of the transformation.

Challenges Before O2 Adoption

Before adopting OpenObserve, Jidu faced several critical challenges that hindered their ability to maintain complete observability and optimize performance:

Challenge Details
Trace Metric Statistics on VMs Jidu relied on virtual machines (VMs) for trace metric statistics. However, shared resources meant that a 2GB memory agent couldn’t handle the collection of hundreds of thousands of metrics dynamically generated by spans.
High Query Concurrency Automated scripts queried data every five minutes to monitor real-time errors in links, creating high query concurrency. This caused frequent timeouts for ordinary users’ queries, slowing down debugging and analysis efforts.
Limited Trace Querying Developers could only query trace IDs within a specific time range. Without knowing the start and end times of traces, pinpointing issues was cumbersome and often required guesswork.
Fragmented Debugging Workflows Debugging required jumping between trace detail pages and business logs manually. This disrupted workflows, slowed collaboration between teams, and made it harder to resolve issues quickly.
Lack of Metric Correlation Engineers couldn’t correlate trace spans with container or node resource metrics directly from the trace detail page, making alarm responses inefficient and time-consuming.
Limited Message Queue Insights Trace links lacked detailed message queue information, making it difficult to analyze message content during debugging.

These limitations increased operational complexity and posed risks to system stability and customer satisfaction.

Improvements After O2 Implementation

The adoption of OpenObserve introduced transformative changes across multiple dimensions of Jidu’s observability strategy. Here’s how O2 addressed each challenge:

1. Automated Trace Metric Statistics
With OpenObserve, metrics were automatically written in real-time during each operation. This eliminated the need for manual collection on VMs and ensured that all trace metrics were captured efficiently without resource contention.

2. Resource Grouping for Queries
O2 introduced a grouping feature that isolated resources between UI queries and automated tasks. This separation ensured that automated queries—such as those used for monitoring real-time errors—no longer affected ordinary users’ UI queries or caused timeouts.

3. Enhanced Trace Querying with Proxy Services
To streamline trace querying, Jidu implemented an external proxy service that recorded start and end times for trace IDs. Before querying O2, the proxy service retrieved these timestamps from the trace ID index service, enabling precise and efficient queries without relying on guesswork.

4. In-Line Log Display
Logs were integrated directly into the O2 trace detail page for each service span. Engineers could now view logs alongside trace details without navigating away from the interface, significantly improving debugging workflows.

5. Metric Correlation on Trace Pages
By adopting OTEL standards for collecting container IPs and node IPs in trace spans, Jidu enabled direct correlation between traces and resource metrics. Clicking on container or node tags within a trace span now displayed relevant resource usage metrics—such as CPU or memory usage—allowing engineers to respond to alarms faster.

6. Message Queue Details Integration
O2 enhanced message queue visibility by including message queue IDs and cluster names in trace spans. Clicking on these fields displayed detailed message content, enabling teams to analyze message queues quickly during debugging sessions.

The following table highlights the improvements delivered by OpenObserve across key areas:

Challenge Before O2 Transformation After O2 Transformation
Trace Metric Statistics Manual collection on VMs; frequent failures due to resource constraints Automated real-time metric collection with no resource contention
Query Concurrency Management High concurrency caused timeouts for ordinary users Resource grouping ensured automated queries didn’t affect UI queries
Trace Querying Efficiency Limited to querying trace IDs within guessed time ranges Proxy service enabled precise querying using start and end times
Debugging Workflow Fragmented; required jumping between tools to view logs Logs integrated directly into O2’s trace detail page
Metric Correlation with Traces No direct correlation between traces and container/node metrics Clicking on tags within traces displayed relevant resource metrics
Message Queue Insights Lacked visibility into message queue details Message queue fields allowed quick access to detailed message content

OpenObserve Implementation Highlights

Jidu’s migration to OpenObserve was executed in three well-planned phases:

1. Deployment on Kubernetes

OpenObserve was deployed in shared Kubernetes clusters with a high-availability (HA) configuration to ensure reliability and scalability across Jidu’s distributed systems.

2. Data Migration

Existing telemetry data was migrated seamlessly using OpenObserve’s Elasticsearch-compatible APIs, ensuring continuity without data loss or downtime.

3. Pipeline Optimization

Data ingestion pipelines were reconfigured to handle higher throughput while leveraging OpenObserve’s efficient compression and indexing capabilities.

The Results

Switching to OpenObserve delivered transformative results for Jidu’s engineering team and overall operations:

Metric Before OpenObserve After OpenObserve Improvement
Trace Coverage 10% 100% 10x
Storage Requirements 1TB/day 0.3TB/day ~70% reduction
Query Response Time Frequent timeouts Sub-second Resolved
Debugging Time Hours per issue Minutes per issue ~8x faster

Key Business Outcomes

  • 100% Trace Fidelity: Full visibility into application behavior allowed engineers to proactively identify and resolve potential issues before they impacted customers.
  • 70% Lower Storage Requirements: Significant reductions in data storage freed up resources for other critical projects.
  • Improved Engineering Efficiency: Faster debugging times enabled engineers to focus on innovation rather than troubleshooting.
  • Enhanced Customer Experience: Improved application stability while increasing reliability and satisfaction among end-users.

What Other Companies Can Learn from Jidu's Transformation

Jidu’s success story offers valuable lessons for companies managing high-throughput telemetry data in mission-critical environments like autonomous vehicles or IoT:

  1. Full Observability Is Affordable: Modern platforms like OpenObserve make it possible to capture all your trace data without breaking the bank.
  2. Efficient Resource Usage Saves Costs: Advanced compression techniques drastically reduce storage requirements while improving system performance.
  3. Unified Platforms Simplify Operations: Combining logs, metrics, traces, and front-end monitoring into one platform reduces complexity and improves workflows.

What Jidu Says About OpenObserve

OpenObserve gave us complete visibility into our systems while cutting our costs by more than half—it’s a no-brainer for any company with growing observability needs.

— Zhao Wei, APM Expert at Jidu

Ready to Transform Your Observability?

Whether you’re managing autonomous systems, IoT devices, or cloud-native applications, OpenObserve can help you achieve full-fidelity tracing without the high costs of legacy solutions. Here’s how you can get started:

  1. Calculate Your Potential Savings → Contact us for a custom quote and an assessment of your potential cost reductions.
  2. Learn More About Our Platform → Explore how OpenObserve works for high-throughput environments.
  3. Request a Demo or Try OpenObserve→ See the platform in action with your own data.

About the Author

Nitya Timalsina

Nitya Timalsina

TwitterLinkedIn

Nitya is a Developer Advocate at OpenObserve, with a diverse background in software development, technical consulting, and organizational leadership. Nitya is passionate about open-source technology, accessibility, and sustainable innovation.

Latest From Our Blogs

View all posts