Tantivy Index
This document explains Tantivy indexing in OpenObserve, the types of indexes it builds, how to use the correct query patterns for both single-stream and multi-stream queries, and how to verify and configure indexing.
Tantivy indexing is an open-source feature in OpenObserve.
What is Tantivy?
Tantivy is the inverted index library used in OpenObserve to accelerate searches. An inverted index keeps a map of values or tokens and the row IDs of the records that contain them. When a user searches for a value, the query can use this index to go directly to the matching rows instead of scanning every log record.
Index types
Tantivy builds two kinds of indexes in OpenObserve:
Full-text index
Full-text index
For fields such as body or message that contain sentences or long text. The field is split into tokens, and each token is mapped to the records that contain it.
Example log records:
- Row 1:
body = "POST /api/metrics error" - Row 2:
body = "GET /health ok" - Row 3:
body = "error connecting to database"
The log body POST /api/metrics error is stored as tokens POST, api, metrics, error. A search for error looks up that token in the index and immediately finds the matching records.
Secondary index
Secondary index
For fields that represent a single exact value. For example, kubernetes_namespace_name. In this case, the entire field value is treated as one token and indexed.
Example log records:
- Row 1:
kubernetes_namespace_name = ingress-nginx - Row 2:
kubernetes_namespace_name = ziox - Row 3:
kubernetes_namespace_name = ingress-nginx - Row 4:
kubernetes_namespace_name = cert-manager
For kubernetes_namespace_name, the index might look like:
ingress-nginx> [Row 1, Row 3]ziox> [Row 2]cert-manager> [Row 4]
A query for kubernetes_namespace_name = 'ingress-nginx' retrieves those rows directly, without scanning unrelated records. By keeping these indexes, Tantivy avoids full scans across millions or billions of records. This results in queries that return in milliseconds rather than seconds.
Configure environment variable
Enable Tantivy indexing
Enable Tantivy indexing
To enable Tantivy indexing, configure the following environment variable:
| Environment Variable | Description | Default Value |
|---|---|---|
ZO_ENABLE_INVERTED_INDEX |
Enables or disables Tantivy indexing | true |
Enable Tantivy result cache (optional)
Enable Tantivy result cache (optional)
The Tantivy result cache feature enhances search performance by storing index query results. It is disabled by default. To enable and configure the cache, set the following environment variables:
| Environment Variable | Description | Default Value |
|---|---|---|
ZO_INVERTED_INDEX_RESULT_CACHE_ENABLED |
Enables or disables the Tantivy result cache | false |
ZO_INVERTED_INDEX_RESULT_CACHE_MAX_ENTRIES |
Maximum number of cache entries | 10000 |
ZO_INVERTED_INDEX_RESULT_CACHE_MAX_ENTRY_SIZE |
Maximum size per cache entry in bytes | 20480 (20KB) |
For a detailed explanation of how the Tantivy result cache works, memory requirements, and performance impact, refer to the Tantivy result cache section below.
Query behavior
Tantivy optimizes queries differently based on whether the field is full-text or secondary, and whether the query operates on a single stream or multiple streams. Using the right operator for each field type ensures the query is served from the index instead of scanning logs.
Note
Tantivy index supports logs, metrics ,traces, and metadata stream.
Single-stream queries
A single-stream query retrieves data from one stream without using JOIN operations or subqueries that involve multiple streams.
Full-text index scenarios
Correct usage:
- Use
match_all()for full-text index fields such asbodyormessage: - Use
NOTwithmatch_all():
Secondary index scenarios
Correct usage:
- Use
=orIN (...)for secondary index fields such askubernetes_namespace_name,kubernetes_pod_name, orkubernetes_container_name. - Use NOT with
=orIN (...)
Inefficient usage:
Mixed scenarios
When a query combines full-text and secondary fields, apply the best operator for each part.
Correct usage:
match_all('error')uses full-text index.kubernetes_namespace_name = 'ingress-nginx'uses secondary index.
Incorrect usage:
AND and OR operator behavior
AND behavior
- If both sides are indexable, Tantivy intersects the row sets from each index.
- If one side is not indexable, the indexable side is still accelerated by Tantivy, and the other side is resolved in DataFusion.
Examples
OR behavior
- If all branches of the OR are indexable, Tantivy unites the row sets efficiently.
- If any branch is not indexable, the entire OR is not indexable. The query runs in DataFusion.
Examples
NOT with grouped conditions
Multi-stream queries
A multi-stream query combines data from two or more streams using JOIN operations or subqueries that convert to JOINs internally. OpenObserve applies Tantivy indexing to both sides of a JOIN to accelerate data retrieval.
What are multi-stream queries?
When a subquery converts to a JOIN, OpenObserve combines data from two sources. In a JOIN operation:
- The left table is the first table in the JOIN operation. It is the base table that the query starts with.
- The right table is the second table in the JOIN operation. It provides additional data that is matched against the left table based on a join condition.
The query engine reads rows from the left table, then for each row, it looks up matching rows in the right table using the join condition.
Example:
t1is the left table. It is the base table.t2is the right table. It is the table being matched.- The join condition
t1.id = t2.iddetermines which rows from both tables are combined.
When a query includes a subquery in a WHERE clause with an IN operator, OpenObserve converts it to a JOIN operation. For example:
- The left table is the outer query, selecting
kubernetes_namespace_namefromdefault. - The right table is the subquery, selecting distinct
kubernetes_namespace_namevalues wherekubernetes_container_nameisziox.
Tantivy can use indexes on both the left table and the right table to accelerate the query.
How indexing works in multi-stream queries
When OpenObserve executes a multi-stream query:
- The query optimizer identifies indexable conditions on both the left table and the right table of the JOIN.
- Tantivy retrieves row identifiers from the index for each table independently.
- The query engine combines the results based on the JOIN condition.
- If both tables use indexes, the query avoids scanning unrelated records entirely.
For example,
In this query, the subquery uses the secondary index on kubernetes_container_name to find matching namespaces, while the outer query uses the secondary index on kubernetes_pod_name. Both sides benefit from Tantivy indexing, eliminating the need for full table scans.
match_all in multi-stream queries
The match_all() function is supported in multi-stream queries with specific limitations. OpenObserve checks whether the full-text index field exists in the stream before applying match_all().
Supported scenarios:
Use match_all() in subqueries that filter a single stream:
match_all() in both the outer query and a subquery with an IN condition:
match_all(), and both leverage the full-text index to retrieve matching row identifiers.
Unsupported scenarios:
Do not use match_all() outside a subquery when the subquery contains aggregation or grouping:
match_all('error') cannot determine which stream to search because the subquery has already aggregated the data.
Partitioned search with inverted index
OpenObserve searches individual partitions using the inverted index when executing multi-stream queries. This behavior ensures that queries distribute efficiently across partitions and leverage indexing at the partition level.
Index Optimizer
What is the Index Optimizer?
OpenObserve includes an index optimizer that accelerates specific query patterns by using Tantivy indexes more efficiently. The optimizer works automatically for both standalone queries and subqueries when certain conditions are met.
The optimizer handles four query patterns: count, histogram, top N, and distinct queries.
Optimized Query Patterns
Count Queries
The optimizer accelerates queries that count total records.
Example:
Requirements:
- All filters in the WHERE clause must be indexable by Tantivy
Histogram Queries
The optimizer accelerates queries that generate histogram data grouped by time intervals.
Example:
Requirements:
- All filters in the WHERE clause must be indexable by Tantivy
Top N Queries
The optimizer accelerates queries that retrieve the top N results based on count, ordered in descending order.
Example:
Requirements:
- All filters in the WHERE clause must be indexable by Tantivy
- The field being grouped must be a secondary index field
Distinct Queries
The optimizer accelerates queries that retrieve distinct values for a field.
Example:
Requirements: - All filters in the WHERE clause must be indexable by Tantivy - The field in the SELECT clause must be a secondary index field - The WHERE clause must use
str_match() on that same field
General Requirements
For all four query patterns, every filter condition in the WHERE clause must be indexable by Tantivy. Refer to the Single-stream queries and Multi-stream queries sections for details on which operators and conditions are indexable.
Tantivy result cache
What is the Tantivy result cache?
The Tantivy result cache stores the output of Tantivy index searches to enhance search performance for repeated queries. When the cache is enabled and a query is executed, OpenObserve checks if identical results already exist in the cache. If found, the query retrieves results from the cache instead of re-executing index lookups, significantly reducing search time.
The cache is disabled by default. To enable it, configure the environment variables described in the Configure environment variable section.
Memory requirements for Tantivy result cache
The Tantivy result cache requires memory based on the number of entries and the size of each entry. Calculate the memory required using this formula:
Example calculation with default configuration:Note
When adjusting ZO_INVERTED_INDEX_RESULT_CACHE_MAX_ENTRIES or ZO_INVERTED_INDEX_RESULT_CACHE_MAX_ENTRY_SIZE, use this formula to ensure sufficient memory is available.
Performance impact
When the cache is enabled and a query result is found in the cache, search time can be reduced from hundreds of milliseconds to a few milliseconds. The cache is most effective for workloads with repeated queries using identical filters.
Verify if a query is using Tantivy
To confirm whether a query used the Tantivy inverted index:
- Open the browser developer tools and go to the Network tab.
- Inspect the query response JSON.
-
Under took_detail, check the value of
idx_took:- If
idx_tookis greater than0, the query used the inverted index. - If
idx_tookis0, the query did not use the inverted index.
- If