Tutorial 99. Pushing Data to OpenSearch#

Danger

The tutorial is not ready yet/ fully working. There are still some parsing errors that could not be resovled via the command line.

Some scenarios we aim to support is organisations who aim to augment their internal search with web data. Internal search systems are often based on Elastic or OpenSearch.

Consquently, in this tutorial we aim to show how owilix can be used to

  1. query data

  2. conver it to JSON

  3. use owilix to push the json to open search

  4. write an own converte for the json to push and then push to open search

Please note that the code here is educational only. The provided scripts will most likely not scale in a productive environment and are not fail safe / without problems.

OWILIX Opensearch Plugin#

Owilix supports a lightweight plugin system, that can register additional commands. One example plugin is the OpenSearch plugin. The plugin expects to receive either a json file or a json stream on stdin and pushes the json to the configured open search cluster.

Testing the Opensearch plugin from a file#

The plugin expects that the credential and opensearch endpoint is configured via environment variables (here define for a linux shell):

export OWILIX_OPENSEARCH_USERNAME=aadfs
export OWILIX_OPENSEARCH_PASSWORD=alksdjflkajsd;fljklsdjf;
export OWILIX_OPENSEARCH_HOST=https://localhost:9200

Now lets create a jsonl file (lets assume its called data.json):

{"name": "John Doe", "age": 29, "city": "London", warc_date="2021-01-01T12:00:00Z"}
{"name": "Mary Jane", "age": 21,  "city": "Berlin", warc_date="2021-01-01T12:00:00Z"}

And push it to an opensearch instance:

owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test file_path=data.json

We can also

cat data.json | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test

Using converters, settings and mappings#

OWI input data can be rather complex, particularly when converting to specific index schemes. owilix therefore supports creating a callable converter class and plugging it in.

For example, a simple datetime converter class is provided within the owilix package.

from datetime import datetime
from typing import List, Dict
from dateutil import parser

class DateConverter:
    def __init__(self, **kwargs):
       pass

    def __call__(self, jsons:List[Dict])->List[Dict]:
        """ expect list of dicts and returns a list of dicts that must be JSON Serializable (i.e. no complex data types"""
        for json in jsons:
            if "warc_date" in json and isinstance(json["warc_date"],str):
                json["warc_date"] =parser.parse(json["warc_date"]).strftime('%Y-%m-%dT%H:%M:%SZ')
            if "day" in json and "month" in json and "year" in json:
                json["date"] = datetime(json["year"],json["month"],json["day"]).strftime('%Y-%m-%dT%H:%M:%SZ')
        return jsons

The converter can be used by specifying the converter class in the plugin call:

cat data.json | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test

Note that the converter can also split, e..g if you want to create chunks of documnts or split up microdata.

you can also provide additional settings and mappings to the opensearch owilix command, but the functionality has not been throughly tested.

Writing your own Converer#

You only need to implement a callable class where the call gets a list of dicts and returns a list of dicts. Then provide the fully qulalifie classname in the convert_fn parameter and you are good to go.

Indexing Dataset Shards#

Indexing dataset shards now combines the owilix plugin with the owilix query capability. We basically make a less on the data (i.e. listing datasets) with a specific query. The output is then piped into the owilix opensearch plugin.

owilix query less --remote it4i:latest "select=url,id,title" as_json=True | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test

The above command would pipe the url, id and title into an opensearch instance defined via the environment variables.