Enrichment
Introduction
Enrichment refers to the process of adding more context to the data. This can be done by adding new fields to the data or by modifying existing fields.
You can create a VRL function that can do enrichment at the time of ingestion or at the time of query.
Some of the examples of enrichment are:
- Add a new field to the data. e.g.
- You have a country code field and you want the full country name.
- You have a status code 1, 2, 3 and you want to add a new field that says if the status is success or failure or unknown.
- You have an IP address and you want to add a new field that says if the IP address is internal or external.
- You have an IP address and you want to get the geo location of the IP address.
- protocol number to protocol name
In order to do enrichment you will need to create a enrichment table. A enrichment table is a CSV file that has the reference data that you want to use to enrich the actual data.
Example
Source data (Log stream)
For example, you have AWS VPC flow logs. You can find more details about VPC flow logs on AWS docs page.
It might look like (I have removed many fields here to simplify things):
[
{ "_timestamp": 1685264705559653, "dstaddr": "10.3.150.41", "packets": 5, "protocol": 6, "srcaddr": "10.3.76.90" },
{ "_timestamp": 1685264705559618, "dstaddr": "173.72.40.32", "packets": 1, "protocol": 17, "srcaddr": "10.3.150.41" },
{ "_timestamp": 1685264705559581, "dstaddr": "10.3.150.41", "packets": 1, "protocol": 17, "srcaddr": "173.72.40.32" },
{ "_timestamp": 1685264705559551, "dstaddr": "10.3.57.95", "packets": 5, "protocol": 6, "srcaddr": "10.3.150.41" }
]
You will notice that protocol number. Looking at it immediately does not tell you what the protocol is. You will need to look up the protocol number in a enrichment table to get the protocol name.
Enrichment table
The enrichment table will look like this:
Desired output
Our goal would be be to get the logs to look like:
[
{ "_timestamp": 1685264705559653, "dstaddr": "10.3.150.41", "packets": 5, "protocol": "TCP", "srcaddr": "10.3.76.90" },
{ "_timestamp": 1685264705559618, "dstaddr": "173.72.40.32", "packets": 1, "protocol": "UDP", "srcaddr": "10.3.150.41" },
{ "_timestamp": 1685264705559581, "dstaddr": "10.3.150.41", "packets": 1, "protocol": "UDP", "srcaddr": "173.72.40.32" },
{ "_timestamp": 1685264705559551, "dstaddr": "10.3.57.95", "packets": 5, "protocol": "TCP", "srcaddr": "10.3.150.41" }
]
6
and 17
have been replaced with TCP
and UDP
respectively.
Hands on exercise
Let's do a hands on exercise in order to understand how to do enrichment.
Upload sample data
Download sample data for VPC flow log and ingest it into your OpenObserve instance.
curl -L https://github.com/openobserve/openobserve/releases/download/v0.4.4/vpc_flow_log.json.gz -o vpc_flow_log.json.gz
curl -L https://github.com/openobserve/openobserve/releases/download/v0.4.4/protocols.csv -o protocols.csv
gunzip vpc_flow_log.json.gz
The above commands will download the sample data and unzip it. It will also download the enrichment table. Now, let's ingest the data into OpenObserve.
curl -u user@domain.com:abqlg4b673465w46hR2905 -k https://api.openobserve.ai/api/User_organization_435345/vpc_flow_log/_json -d "@vpc_flow_log.json"
For self hosted, you can use the following command:
curl http://localhost:5080/api/default/vpc_flow_log/_json -i -u "root@example.com:Complexpass#123" -d "@vpc_flow_log.json"
Output
HTTP/2 200
date: Sun, 28 May 2023 11:56:42 GMT
content-type: application/json
content-length: 76
vary: accept-encoding
strict-transport-security: max-age=15724800; includeSubDomains
access-control-allow-origin: *
access-control-allow-credentials: true
access-control-allow-methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
access-control-allow-headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
access-control-max-age: 1728000
{"code":200,"status":[{"name":"vpc_flow_log","successful":9071,"failed":0}]}%
Upload enrichment table
Now let's setup our enrichment table. Go to the OpenObserve UI and click on the Functions > enrichment tables
. Click on Add enrichment table
button. Give it a name protocols
, upload the CSV file and click on Save
. You should see the following screen:
You could see the contents in logs page.
Enrich the log stream
Now that you have the data and the enrichment table set, lets head over to logs page.
Adde the below VRL function in the VRL function box and click on Run query
button.
VRL function | |
---|---|
You should see an additional field protocol_keyword
in the field list. This field is added by the VRL function. The VRL function is doing a lookup in the enrichment table and adding the protocol_keyword
field to the output.
Let's break down each line in the VRL function:
-
protocol, err = get_enrichment_table_record("protocols", {"protocol_number": to_string!(.protocol)})
: This is retrieving a record from the enrichment table namedprotocols
. The specific record to retrieve is determined by theprotocol_number
key, which is set to the string conversion of the.protocol
field from the event record currently being processed..protocol
is of type int64 in the log data whereasprotocol_number
in enrichment table is of type string. The function get_enrichment_table_record returns two values: the record (if found) and an error object (if there was an issue). The record is being stored in protocol and the error is being stored in err. -
.protocol_keyword = protocol.keyword
: This line is adding or updating a field named protocol_keyword in the current event record, setting its value to the keyword field from the protocol record retrieved in the previous step. -
.
: This line is a placeholder, indicating that the current event record should be returned as the output of the VRL function.
So, the overall purpose of this script is to enrich event data by adding a protocol_keyword
field, which is retrieved from an enrichment table named "protocols". The table lookup key is derived from the protocol field of the event data. If the protocol field is not present, or if the lookup fails, an error will be generated.
GeoIP enrichment using MaxMind GeoIP lite database
OpenObserve supports GeoIP enrichment (IP address to location) using the Maxmind GeoIP lite database. OpenObserve downloads the GeoIP database from Maxmind and stores it locally. This way Maxmind tables are available to you as part of the OpenObserve installation and you do not need to upload any enrichment table. You can use the VRL function get_enrichment_table_record
to enrich your data with the GeoIP data.
e.g.
.geo_city = get_enrichment_table_record!("maxmind_city", {"ip": .ip })
.geo_asn = get_enrichment_table_record!("maxmind_asn", {"ip": .ip })
.
maxmind_city table returns example:
{
"city_name": "Bengaluru",
"continent_code": "AS",
"country_code": "IN",
"country_name": "India",
"latitude": "12.9634",
"longitude": "77.5855",
"postal_code": "560002",
"region_code": "KA",
"region_name": "Karnataka",
"timezone": "Asia/Kolkata"
}
maxmind_asn table returns example: