How Jidu Scaled Smart Car Tracing with OpenObserve

Jidu, the technology company behind Jiyue’s autonomous driving systems, faced growing challenges in maintaining observability across their distributed systems. As telemetry data from their vehicles increased exponentially, their existing Elasticsearch-based observability stack was no longer sufficient.
The team identified three critical issues:
Elasticsearch’s scalability limitations forced Jidu to sample only 10% of traces, leaving 90% of application behavior hidden. This lack of visibility made it difficult to identify performance bottlenecks and diagnose issues effectively.
Despite allocating 24 CPU cores, 96GB of RAM, and 12TB of storage, Elasticsearch struggled with long-term queries and statistical analysis. Engineers frequently encountered memory errors and timeouts, slowing down debugging efforts.
Storing 1TB/day of sampled data resulted in significant infrastructure costs. Scaling to handle full-fidelity tracing would have been prohibitively expensive.
For Jidu, these challenges weren’t just technical—they directly impacted the stability and reliability of their autonomous driving systems, posing risks to customer satisfaction and operational efficiency.
To address these challenges, Jidu migrated to OpenObserve, an open-source observability platform designed for high-performance monitoring at scale. OpenObserve offered a unified solution for logs, metrics, distributed tracing, and front-end monitoring—all while significantly reducing resource consumption and costs.
Jidu’s transition to OpenObserve (O2) marked a significant shift in how the company managed observability for its autonomous driving systems. By addressing long-standing challenges and introducing innovative solutions, O2 transformed Jidu’s operations, making their systems more efficient, reliable, and scalable. Below is a detailed breakdown of the transformation.
Before adopting OpenObserve, Jidu faced several critical challenges that hindered their ability to maintain complete observability and optimize performance:
Challenge | Details |
---|---|
Trace Metric Statistics on VMs | Jidu relied on virtual machines (VMs) for trace metric statistics. However, shared resources meant that a 2GB memory agent couldn’t handle the collection of hundreds of thousands of metrics dynamically generated by spans. |
High Query Concurrency | Automated scripts queried data every five minutes to monitor real-time errors in links, creating high query concurrency. This caused frequent timeouts for ordinary users’ queries, slowing down debugging and analysis efforts. |
Limited Trace Querying | Developers could only query trace IDs within a specific time range. Without knowing the start and end times of traces, pinpointing issues was cumbersome and often required guesswork. |
Fragmented Debugging Workflows | Debugging required jumping between trace detail pages and business logs manually. This disrupted workflows, slowed collaboration between teams, and made it harder to resolve issues quickly. |
Lack of Metric Correlation | Engineers couldn’t correlate trace spans with container or node resource metrics directly from the trace detail page, making alarm responses inefficient and time-consuming. |
Limited Message Queue Insights | Trace links lacked detailed message queue information, making it difficult to analyze message content during debugging. |
These limitations increased operational complexity and posed risks to system stability and customer satisfaction.
The adoption of OpenObserve introduced transformative changes across multiple dimensions of Jidu’s observability strategy. Here’s how O2 addressed each challenge:
1. Automated Trace Metric Statistics
With OpenObserve, metrics were automatically written in real-time during each operation. This eliminated the need for manual collection on VMs and ensured that all trace metrics were captured efficiently without resource contention.
2. Resource Grouping for Queries
O2 introduced a grouping feature that isolated resources between UI queries and automated tasks. This separation ensured that automated queries—such as those used for monitoring real-time errors—no longer affected ordinary users’ UI queries or caused timeouts.
3. Enhanced Trace Querying with Proxy Services
To streamline trace querying, Jidu implemented an external proxy service that recorded start and end times for trace IDs. Before querying O2, the proxy service retrieved these timestamps from the trace ID index service, enabling precise and efficient queries without relying on guesswork.
4. In-Line Log Display
Logs were integrated directly into the O2 trace detail page for each service span. Engineers could now view logs alongside trace details without navigating away from the interface, significantly improving debugging workflows.
5. Metric Correlation on Trace Pages
By adopting OTEL standards for collecting container IPs and node IPs in trace spans, Jidu enabled direct correlation between traces and resource metrics. Clicking on container or node tags within a trace span now displayed relevant resource usage metrics—such as CPU or memory usage—allowing engineers to respond to alarms faster.
6. Message Queue Details Integration
O2 enhanced message queue visibility by including message queue IDs and cluster names in trace spans. Clicking on these fields displayed detailed message content, enabling teams to analyze message queues quickly during debugging sessions.
The following table highlights the improvements delivered by OpenObserve across key areas:
Challenge | Before O2 Transformation | After O2 Transformation |
---|---|---|
Trace Metric Statistics | Manual collection on VMs; frequent failures due to resource constraints | Automated real-time metric collection with no resource contention |
Query Concurrency Management | High concurrency caused timeouts for ordinary users | Resource grouping ensured automated queries didn’t affect UI queries |
Trace Querying Efficiency | Limited to querying trace IDs within guessed time ranges | Proxy service enabled precise querying using start and end times |
Debugging Workflow | Fragmented; required jumping between tools to view logs | Logs integrated directly into O2’s trace detail page |
Metric Correlation with Traces | No direct correlation between traces and container/node metrics | Clicking on tags within traces displayed relevant resource metrics |
Message Queue Insights | Lacked visibility into message queue details | Message queue fields allowed quick access to detailed message content |
Jidu’s migration to OpenObserve was executed in three well-planned phases:
OpenObserve was deployed in shared Kubernetes clusters with a high-availability (HA) configuration to ensure reliability and scalability across Jidu’s distributed systems.
Existing telemetry data was migrated seamlessly using OpenObserve’s Elasticsearch-compatible APIs, ensuring continuity without data loss or downtime.
Data ingestion pipelines were reconfigured to handle higher throughput while leveraging OpenObserve’s efficient compression and indexing capabilities.
Switching to OpenObserve delivered transformative results for Jidu’s engineering team and overall operations:
Metric | Before OpenObserve | After OpenObserve | Improvement |
---|---|---|---|
Trace Coverage | 10% | 100% | 10x |
Storage Requirements | 1TB/day | 0.3TB/day | ~70% reduction |
Query Response Time | Frequent timeouts | Sub-second | Resolved |
Debugging Time | Hours per issue | Minutes per issue | ~8x faster |
Jidu’s success story offers valuable lessons for companies managing high-throughput telemetry data in mission-critical environments like autonomous vehicles or IoT:
OpenObserve gave us complete visibility into our systems while cutting our costs by more than half—it’s a no-brainer for any company with growing observability needs.
— Zhao Wei, APM Expert at Jidu
Whether you’re managing autonomous systems, IoT devices, or cloud-native applications, OpenObserve can help you achieve full-fidelity tracing without the high costs of legacy solutions. Here’s how you can get started: