Performance Optimization

Introduction

Performance can be divided primarily into 2 areas:

WRITE (Ingestion)
READ (Log search, aggregation, dashboards)

Ingestion

OpenObserve stores data in parquet format compressed using zstd to reduce the storage requirements. In distributed mode OpenObserve stores data in Object storage solutions like s3 (and compatible stores. Azure blob is supported too) that are highly cost effective and scalable and allow you to reduce storage cost significantly.

OpenObserve does not do full-text indexing of data unlike Elasticsearch which is extremely compute intensive. This allows OpenObserve to be 5-10 times more performant for ingestion than Elasticsearch for the same amount of hardware.

There are things that you can do to optimize the ingestion performance of OpenObserve additionally:

Ensure you have enough CPU cores available for ingestion. OpenObserve uses all available CPU cores for ingestion. If you have 16 CPU cores available for ingestion, you can expect to ingest 4 times more data than if you had 4 CPU cores available for ingestion.
Using VRL functions at ingest time will use additional CPU during ingestion and can reduce your throughput. Impact can vary based on complexity of your functions. Test and plan accordingly.
CPU to memory ratio of 2x is a good ratio for ingester nodes. For example, if you have 4 CPU cores available for ingestion, you should have 16 GB of RAM available for ingestion. on AWS that means m6i or m7g instances (As of 2023) are recommended for ingestion.
On AWS we recommend m7g instances which are typically 40% faster and cost approximately 15% less than m6i instances.
Use SIMD version of containers/binaries for ingestion. They are able to leverage latest CPU instructions both on Intel and ARM CPUs and can help in calculating hashes for bloom filters faster. Additionally this improves aggregation performance significantly.
OpenObserve uses local disk on ingesters for temporary storage before batching and pushing data to object storage. This requires high IOPS. 3000 IOPS for most workloads should be good enough but test, measure and size it appropriately for your workload.

OpenObserve is designed to handle 100s of thousands of events per second per node. Mostly this will result in 7-30 MB/sec ingestion per vCPU core (varies on various factors).

Very High speed ingestion

If you have very high ingestion speed requirements (e.g. 100s of thousands of events per second), you can follow the below to improve ingestion performance:

ZO_FEATURE_PER_THREAD_LOCK=true : This setting will improve performance by 60-100%. Set this up if you are looking to gain greater than 12 MB/Second/Core (~12k records per second/core) ingestion speed. We have seen ingestion speed of 23 MB/Second/core in some scenarios. This is disabled by default. Improvement comes from the way we handle WAL (Write Ahead Log) in OpenObserve. By default we use a single WAL file per ingester node even if it has multiple CPU cores available, necessitating a lock on the WAL file per core as we use a thread per core for higher performance. This is a good default for most workloads. However if you have very high ingestion speed requirements you can enable this setting. This will result in one WAL file per CPU core, removing the lock contention. Enabling this setting where you are not ingesting at high speed will result in smaller WAL files resulting in inefficient storage operations and lower ingestion performance.
Use OTLP grpc for ingestion instead of http json or bulk interfaces. This should give another 60-100% performance boost during ingestion.
RUST_LOG=error : This will reduce the number of logs generated by OpenObserve and will improve ingestion performance by 2-5%.

Log search

OpenObserve does not do full-text indexing like Elasticsearch. This results in very high compression ratio of ingested data. Coupled with object storage this can give you ~140x lower storage cost. However, this also means that search performance for full text queries in absence of full-text indexes might suffer. However log data has some unique properties that can be leveraged to improve search performance significantly. OpenObserve uses following techniques to improve search performance:

Column pruning

OpenObserve uses columnar storage format (parquet) which allows it to read only the columns that are required for a query. This reduces the amount of data that needs to be read from disk and improves search performance. This technique is called column pruning. It reduces the amount of data that needs to be read from disk. You must switch to SQL query mode for this and specify only the columns that you want to be returned.

Predicate pushdown:

Standard Partitioning (KeyValue partitions)

Note: Use For low cardinality fields

OpenObserve uses a technique called predicate pushdown to further reduce the amount of data that needs to be read from disk. This is done by pushing down the filters to the storage layer. By default OpenObserve will partition data by org/stream/year/month/day/hour. So when searching, if you know the time range for which you are searching for data you should specify it and OpenObserve will skip data not following in date range and will search across much less data. This will improve search performance and will utilize predicate pushdown. You can also enable additional partitioning for fields on any stream by going to stream settings. Some good candidates for partition keys are host and kubernetes namespace. You can have multiple partition keys for a stream. You can then specify partition keys in your query. e.g. host='xyz' and kubernetes_namespace='abc'. This will improve search performance and will utilize predicate pushdown. DO NOT enable partitioning on all/many fields as it may result in many small underlying parquet files which will result in low compression, extremely poor search performance and high s3 storage costs . As a rule of thumb you would want the size of each stored parquet file to be above 5 MB. Order of partitions does not matter. You can partition by namespace, pod or pod, namespace.

This is how the underlying disk structure will look like if you have partitioned by k8s_container_name, k8s_namespace_name, k8s_pod_name:

data/stream/files/default/logs/optimized/2023/10/20/03/k8s_container_name=app
├── k8s_namespace_name=p1
│   ├── k8s_pod_name=k1-6cf68b7dfb-t4hbb
│   │   └── 7120990213769396224KU6f10.parquet
│   ├── k8s_pod_name=k2-5cb8dc848-jgztx
│   │   └── 7120989677196279808qFYgcG.parquet
│   ├── k8s_pod_name=k3-0
│   │   └── 7120984091553562624nAndCG.parquet
│   └── k8s_pod_name=k4-6f97cb4d86-tbb52
│       └── 71209900996256071685r1ZsZ.parquet
└── k8s_namespace_name=p2
    └── k8s_pod_name=k7-7c65b8fdd9-h48xs
        └── 7120981323325505536b8xiP6.parquet

You can now specify partition keys in your query. e.g. k8s_container_name=app and k8s_namespace_name='p1' and k8s_pod_name='k1-6cf68b7dfb-t4hbb' and match_all('error'). This will improve search performance and will utilize predicate pushdown. You should enable partitioning for low cardinality (Relatively not too many possible values for a field - e.g. namespace, host) data.

Hash partition

Note: Use For low cardinality fields

Standard KeyValue partitions are a good way to partition data. However, if data in a particular field is not divided evenly then you may end up with some partitions having more data than others.

e.g. For a namespace based partitioning if you have 10 million log records with following number of log entries for each namespace:

namespace01: 4 million
namespace02: 2 million
namespace03: 1 million
namespace04: 500 thousand
namespace05: 450 thousand
namespace06: 25 thousand
namespace07: 15 thousand
namespace08: 5 thousand
namespace09: 2 thousand
namespace10: 1 thousand
namespace11: 500
namespace12: 200
namespace13: 100
namespace14: 100
namespace15: 100
namespace16: 50
namespace17: 50
namespace18: 50
namespace19: 50
namespace20: 50
namespace21: 50
namespace22: 50
namespace23: 50
namespace24: 30
namespace25: 20

If you set KeyValue partition on the namespace field, you will have 25 partitions. However, the data is not evenly distributed across these partitions. For example, one partition may contain 4 million records while another may have only 2 million. This uneven distribution can lead to some partitions being slow to search while others are fast. Additionally, smaller partitions may result in small parquet files with low compression and high storage costs in S3.

In such cases you should enable hash partitioning. Hash partitioning is a way to divide data across partitions based on defined number of hash buckets. You can enable hash partitioning for a field by going to stream settings.

For the above scenario, you can enable hash partitioning on namespace field with 8 buckets. This will divide the data across 8 partitions instead of 25. Distribution of records in each partition will be a lot less skewed than in case of standard partitioning resulting in better performance for majority of the queries based on the values in this field and will result in better compression and lower s3 storage costs.

You can specify the number of buckets (8, 16, 32, 64, 128) in the index in stream setting when setting up hash partitioning for a particular field.

Time range partition

Note: Enabled by default and cannot be disabled

OpenObserve partitions all data by time range by default in addition to any other partitions that you may have defined. It always makes sense to specify the shortest time range to search for. e.g. if you know that you are looking for data for last 15 minutes, you should specify that in your query by selecting it from the top right corner. This will improve search performance and will utilize predicate pushdown.

Bloom filter (available starting v0.8.0) (For high cardinality fields)

Note: Use For high cardinality fields

A bloom filter is a space efficient probabilistic data structure that allow you to check if a value exists in a set. It solves proverbial needle in a haystack problem. OpenObserve uses bloom filters to check if a value exists in a column. This allows OpenObserve to skip reading the data from disk if the value does not exist in the column. This improves search performance by reducing search space. You must specify bloom filter for the specific fields that you want to search. Fields that are well suited for bloom filter are of very high cardinality .e.g. UUID, request_id, trace_id, device_id, etc. You can specify bloom filter for a field by going to stream settings. You can specify multiple fields for bloom filter. e.g. request_id and trace_id. You can then use the fields in your query that will utilize bloom filter. e.g. request_id='abc' and trace_id='xyz'. Enabling bloom filter on a field with low cardinality will not result in any performance improvement.

Bloom filters have negligible storage overhead.

Full text search

Log search involves full text search. When you try to do a full text search it essentially does an equivalent of grep and scans all the data in particular fields that have full text search enabled. This is facilitated by match_all function. This can be slow for large amount of data. In order to be able to do full text search efficiently you should reduce the search space where you are doing full text search. There are 2 things you can do to improve full text search:

Do not use match_all directly on full data set, but always use it in combination with one or more filters which can themselves be optimized by partitions or bloom filters. e.g. host='host1' and match_all('error') or k8s_namespace_name='ns1' and match_all('error') or bank_account_number='653456-54654-65' and match_all('error'). In all of these examples using the filter reduces search space for match_all. Additionally if host and k8s_namespace_name fields are partitioned then you have reduced search space very well and will gain the improvements in full text search. bank_account_number, request_id, trace_id, device_id are good candidates for bloom filter and should be used together with match_all to improve full text search performance.
Enable full text search only on the fields that you need. e.g. body, log, message etc. Fields like hostname, ip address, etc. are not good candidates for full text search and you should not enable full text search on these fields. You can enable full text search on a field by going to stream settings. You can specify multiple fields for full text search. e.g. body and message. You can then use the fields in your query that will utilize full text search. e.g. host='host1' and match_all('error').

Inverted Index (available starting v0.10.0)

Above mentioned partitioning schemes and bloom filters are good for fields where you are doing equality based searches. e.g. request_id='abc'. For full text search in fields that contain longer log lines, OpenObserve in its earlier releases relied on brute force search (how grep works) which works well for most of the scenarios. However, for very large data sets this can be slow. You can enable inverted index to improve full text search performance for such fields. Do not enable inverted index for fields that are not used for full text search but are used for equality based searches. Bloom filters and hash partitions are better suited for equality based searches.

Inverted index is a data structure that maps content to its location in a document. It is used to optimize full text search. OpenObserve uses inverted index to optimize full text search. You must specify inverted index for the specific fields that you want to search. Fields that are well suited for inverted index have long amount of text and has many words. e.g. body, log, message, etc.

e.g. log line:

setup.go:202] cert-manager/issuers "msg"="skipping re-verifying ACME account as cached registration details look sufficient"

and are used for full text search.

You can specify inverted index for a field by going to stream settings. You can specify multiple fields for inverted index. e.g. body and message. You can then use the fields in your query that will utilize inverted index. e.g. match_all("searchterm").

Based on our tests, inverted index can improve full text search performance by up to 1000x. Enabling inverted index will lead to approximately 25% storage overhead for the stream and may vary based on data entropy.

In memory caching

OpenObserve can use RAM to cache the data that is read from disk/s3. This reduces the amount of data that needs to be read from disk during search and improves search performance. OpenObserve by default will try to use all the available RAM to improve performance. This can also mean high memory utilization. You can use following environment variables to configure the cache:

Environment Variable	Value	Description
ZO_MEMORY_CACHE_MAX_SIZE	4096	This will limit the query cache to 4GB (value in MB)
ZO_MEMORY_CACHE_DATAFUSION_MAX_SIZE	4096	This will limit the query engine memory pool to 4GB (value in MB)

You want to have at least 8 GB of memory with the above settings.

On disk caching

OpenObserve can use disk on queriers to cache the data that is read from s3. This reduces the amount of data that needs to be read from s3 during search and improves search performance. You can use following environment variables to configure the cache:

Environment Variable	Default Value	Description
ZO_DISK_CACHE_ENABLED	true	enable on-disk caching for files. Latest files are cached for accelerating queries. when the memory cache is not enough OpenObserve will try to cache in local disk, you can consider the memory cache to be first level cache and disk cache to be second level.
ZO_DISK_CACHE_MAX_SIZE	-	default 50% of the total free disk, one can set it to desired amount unit: MB
ZO_DISK_CACHE_SKIP_SIZE	-	default 80% of the total disk cache size, A query will skip disk cache if it need more than this value. one can set it to desired amount unit: MB
ZO_DISK_CACHE_RELEASE_SIZE	-	default drop 1% entries from in-disk cache as cache is full, one can set it to desired amount unit: MB

RAM is generally much more expensive than disk and you may not have enough RAM to cache all the data. In this case you can use fast NVMe SSDs to cache the data. i4 and i3 instances on AWS are good candidates for this.

Pre-cache Files for Faster Query Execution

Set ZO_CACHE_LATEST_FILES_ENABLED to true to improve query performance.

How It Works:

File Creation: The ingester or compactor node (the source node) creates a new .parquet or .ttv file.
Event Notification: The system performs a hashing operation to select one of the querier nodes (the receiving node) to receive a notification about the new file.
File Download and Caching: The querier node downloads the file from S3 and stores it in its local cache.
Subsequent Queries: The next time a query for this file is made, the system retrieves the file from the local cache on the querier node, rather than fetching it again from S3. Note:
When disabled (default): Files are cached only when queried. More storage-efficient, but may cause slower query times if the files are not already cached.
When enabled: Files are pre-cached when created, making queries faster, but may waste storage if the files are never needed for queries.

Set ZO_CACHE_LATEST_FILES_DOWNLOAD_FROM_NODE to true to optimize file downloads and reduce S3 load.

How It Works:

File Request: When a querier node (the receiving node) needs a file, it first checks if the ingester node (the source node) has the file cached.
Cache Check:
- If the ingester node has the file cached, the querier node will download it from the ingester node.
- If the ingester node does not have the file cached, the querier node will then fetch the file from S3.

Result caching

Result caching is incredibly powerful mechanism to speed up repetitive or similar queries.

Let's take an example.

You have built a dashboard with x-axis as timestamp and y-axis as count of requests. You and your colleagues check this dashboard every day. Below is the output table (It could be anything - table, histogram, line chart, etc) :

Date	Request Count
1-Jan-2025	10
2-Jan-2025	15
3-Jan-2025	25
4-Jan-2025	20
5-Jan-2025	17
6-Jan-2025	18

If you started ingesting data on Jan 1st and ran query for 7 days, you would process all the raw data and calculate the output. On Jan 2nd you want to get the data again, you could by default calculate data for Jan 1st and Jan 2nd, however when result caching is enabled, you would have the result for Jan 1st cached. This will prevent processing of raw data for Jan 1st and you would get results much faster. If your colleagues try to view the same dashboard after you have visited the dashboard then it will be pretty instantaneous for them as the result has been cached already.

What happens when you visited the dashboard on Jan 3rd at 10:00 AM and your colleague visited the dashboard at 11:00 AM. Time range for the dashbboard is 2 days. In this case:

you would get data from Jan 1st 10:00 AM to Jan 3rd 10:00 AM
your colleague would get data from Jan 1st 11:00 AM to Jan 3rd 11:00 AM

Result caching, while fetching the data for your colleague will discard the cached data from Jan 1st 10:00 AM to Jan 1st 11:00 AM. Result caching will also calculate data from Jan 3rd 10:00 AM to Jan 3rd 11:00 AM and add it to the totals for your colleague to get the right results. OpenObserve had to crunch data only for 1 hour for your colleague instead of 48 hours in order to display the data. This results in huge performance gains.

Result caching is implemented for 2 kinds of queries:

Time series based queries
1. e.g.
```
select histogram(_timestamp) as x_axis1, count(body_status) from default group by x_axis1
```
2. In these kind of queries x-axis is always time. This allows for result caching based on time and discard old cached data that is not required for the new query. e.g. You have 3 days of cached result and then you run another query for just 1 day. In this case 2 days of cached result will be discarded and only 1 day of cached data is used which provides performance gain. This is also dependent on the histogram interval of the query. If the histogram interval of query was 5 minutes earlier in query 1 and you chose (or auto selected a different interval) then result cache will not get hit. You might experience this when seraching across wide raneg fo queries. e.g. 30 minutes query might have 10 seconds auto interval set, while 1 month duration query might have 6 hours of auto interval set.
Aggregation based queries with no time series - To be implemented
1. e.g.
```
select count(*) from default group by body_status
```
2. To be implemented

To enable result caching, set these environment variables:

ZO_RESULT_CACHE_ENABLED: "true" # Enable result cache for query results
ZO_USE_MULTIPLE_RESULT_CACHE: "true" # Enable to use mulple result caches for query results

Query partitioning

Query performance UX is not always about delivering query results faster. Imagine if you woudn't have to wait for the results of the query but could keep getting the results of the query incrementally as they are processed. This would be similar (but slighly better) to knowing the status of where your uber driver is and how long s/he is going to take to reach you even if it takes the same time without knowing it.

In the case of a query result on log search page or getting the results on dashboard panel, OpenObserve can partition the query to get results incrementally.

e.g. A query for 1 day may be broken into 4 queries of 6 hours each (UI would automatically do this for you) and you would see the results of first 6 hours and then incrementally get all the results. All the requests are made incrementally by the UI. By default UI uses AJAX requests for each qyery partition.

While query partitioning can improve user experience greatly, it can also reduce the overall speed of getting the result. e.g. One day query was broken into 48 individual queries. Now this query without partition may have gitten completed in 6 seconds. Howver making 48 separate HTTP requests sequentially may take 24 seconds to get the results (HTTP requests have overhead). In order to tackle this you can enable websockets. You can enable websockets using:

ZO_WEBSOCKET_ENABLED: "true"

Enabling websockets would also require you to setup more things if you are using a reverse proxy like nginx.

Official helm chart has all of this setup for you so you don't have to worry about it. However if you are setting it up yourself or using another environment make sure that these (or it's equivalents) are configured:

Add nginx annotations:

    nginx.ingress.kubernetes.io/proxy-http-version: "1.1" # Enable HTTP/1.1 for WebSockets
    nginx.ingress.kubernetes.io/enable-websocket: "true"
    # nginx.ingress.kubernetes.io/connection-proxy-header: keep-alive # disable keep alive to use websockets
    nginx.ingress.kubernetes.io/proxy-set-headers: |
      Upgrade $http_upgrade;
      Connection "Upgrade";

Websockets as of 0.14.1 is an experimental feature and you must enable it from the UI as well. Settings > General settings > Enable Websocket Search

Result caching + Query partition + Websockets = Huge performance gains and great UX.

Large Number of Fields

Each log entry may contain anywhere from a minimum of 1 field (e.g. _timestamp) up to thousands of fields. By default, the Logs UI runs a query like:

SELECT * FROM stream1

The table below shows the approximate storage required for each record as the number of fields increases:

Number of Fields	Size per Record	Size of 100 Records	Size of 1 Million Records
100	3 KB	0.29 MB	2.86 GB
1,000	30 KB	2.92 MB	28.6 GB
10,000	300 KB	29.2 MB	286 GB

Common causes for large number of fields

Logs from multiple applications of diffferent types being ingested in the same stream. If app1 has 50 fields and app2 has 75 fields with no fields that have the same name in both apps and logs from both applications are being ingested in the same stream then the stream will have 125 fields. Ideally you should send logs from both of these application to 2 different streams. You can do this by configuring your log forwarder (fluentbit, vector, otel-collector, etc) or you can use OpenObserve pipelines to do it.
Large number of nested fields in the JSON record that is being sent. OpenObserve flattens the JSON that it receives and if there are too many levels then the number of fields can be huge. You can avoid having large number of fields due to high nested levels is by setting the ZO_INGEST_FLATTEN_LEVEL environment variable to say 2 or 3 .

OpenObserve provides various optimizations for querying. However, having a large number of fields in your log entries can lead to the following issues:

1. Increased Storage Requirements

Solution: Reduce the number of fields. Typically, for troubleshooting, you don’t need hundreds of fields per record. Aim for fewer than 50 fields to keep data sizes manageable. You can use pipelines to reduce and filter out unnecessary fields.

2. Slower Queries and solutions

Each query that uses SELECT * must process all fields, which can be slow.

Solutions:

2.1 Control how many fields you are sending to OpenObserve (Recommended)

Try not to send more than 100 fields to OpenObserve. For most use cases humans cannot process more than 100 fields anyway.

2.2 Delete the fields that you don't need from the stream setting

Deleting the fields (in stream settings in OpenObserve) that you don't need will allow for faster queries. This may be desirabe in scenarios where at some point in time you had a large number of fields as you were pulling logs from multiple applications but not anymore.

2.3 Quick Mode

Instead of using:

SELECT * FROM stream1

Try specifying only the required fields:

SELECT field1, field2 FROM stream1

This allows OpenObserve to perform column projection, making queries faster. You can manually write queries this way or enable Quick Mode in the Logs UI and select only the needed fields. However, this approach must be done manually for each query, and users may forget.

2.4 User-Defined Schema (UDS)

By enabling User-Defined Schemas (via ZO_ALLOW_USER_DEFINED_SCHEMAS=true), you can define a set of important fields for a stream in its settings.

Example: If you have 5,000 fields and select only 50 as part of the UDS, queries will now only consider these 50 fields directly, greatly improving performance. The remaining 4,950 fields will be combined into a single _raw field (a string), which won’t be searchable.

If you later need one of the fields from the _raw data to be searchable, simply add it to the UDS in the stream’s settings. After doing so, this field will become searchable going forward.