Tutorial 99. Pushing Data to OpenSearch#
Danger
The tutorial is not ready yet/ fully working. There are still some parsing errors that could not be resovled via the command line.
Some scenarios we aim to support is organisations who aim to augment their internal search with web data. Internal search systems are often based on Elastic or OpenSearch.
Consquently, in this tutorial we aim to show how owilix can be used to
query data
conver it to JSON
use owilix to push the json to open search
write an own converte for the json to push and then push to open search
Please note that the code here is educational only. The provided scripts will most likely not scale in a productive environment and are not fail safe / without problems.
OWILIX Opensearch Plugin#
Owilix supports a lightweight plugin system, that can register additional commands. One example plugin is the OpenSearch plugin. The plugin expects to receive either a json file or a json stream on stdin and pushes the json to the configured open search cluster.
Testing the Opensearch plugin from a file#
The plugin expects that the credential and opensearch endpoint is configured via environment variables (here define for a linux shell):
export OWILIX_OPENSEARCH_USERNAME=aadfs
export OWILIX_OPENSEARCH_PASSWORD=alksdjflkajsd;fljklsdjf;
export OWILIX_OPENSEARCH_HOST=https://localhost:9200
Now lets create a jsonl file (lets assume its called data.json):
{"name": "John Doe", "age": 29, "city": "London", warc_date="2021-01-01T12:00:00Z"}
{"name": "Mary Jane", "age": 21, "city": "Berlin", warc_date="2021-01-01T12:00:00Z"}
And push it to an opensearch instance:
owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test file_path=data.json
We can also
cat data.json | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test
Using converters, settings and mappings#
OWI input data can be rather complex, particularly when converting to specific index schemes. owilix therefore supports creating a callable converter class and plugging it in.
For example, a simple datetime converter class is provided within the owilix package.
from datetime import datetime
from typing import List, Dict
from dateutil import parser
class DateConverter:
def __init__(self, **kwargs):
pass
def __call__(self, jsons:List[Dict])->List[Dict]:
""" expect list of dicts and returns a list of dicts that must be JSON Serializable (i.e. no complex data types"""
for json in jsons:
if "warc_date" in json and isinstance(json["warc_date"],str):
json["warc_date"] =parser.parse(json["warc_date"]).strftime('%Y-%m-%dT%H:%M:%SZ')
if "day" in json and "month" in json and "year" in json:
json["date"] = datetime(json["year"],json["month"],json["day"]).strftime('%Y-%m-%dT%H:%M:%SZ')
return jsons
The converter can be used by specifying the converter class in the plugin call:
cat data.json | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test
Note that the converter can also split, e..g if you want to create chunks of documnts or split up microdata.
you can also provide additional settings and mappings to the opensearch owilix command, but the functionality has not been throughly tested.
Writing your own Converer#
You only need to implement a callable class where the call gets a list of dicts and returns a list of dicts. Then provide the fully qulalifie classname in the convert_fn parameter and you are good to go.
Indexing Dataset Shards#
Indexing dataset shards now combines the owilix plugin with the owilix query capability. We basically make a less on the data (i.e. listing datasets) with a specific query. The output is then piped into the owilix opensearch plugin.
owilix query less --remote it4i:latest "select=url,id,title" as_json=True | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test
The above command would pipe the url, id and title into an opensearch instance defined via the environment variables.