Tutorial 99. Pushing Data to OpenSearch#

Danger

The tutorial is not ready yet/ fully working. There are still some parsing errors that could not be resovled via the command line.

Scenario This tutorial explores the scenario where you have a list of urls for which you aim to build an opensearch index for (for further use).

We will conduct the following steps

  1. Create a list of ursl

  2. Create a command to filter the urls using owilix

  3. Createa a command to push the filter-results to an opensearch cluster using owilix

  4. run the command.

Prerequisits

The tutorial focuses on using command lines under Linux or Macosx. It also assumes that you have installed owilix and that the tool is working.

OWILIX Site filtering#

owilix supports the use of a list of urls, provided as csv file, to query index shards.

First, we create a list of urls, i.e. lets call it url-list.csv

uni-passau.de
www.opensearchfoundation.org
https://www.blogger.com/blog

The url-list will be parsed using the url-normalize packages and decomposed into scheme, sub-domains, domains, url-suffix, path. We then match these elements to the corresponding fields in the parquet index.

To see whether it is working, you can simply use

owilix query sites --remote all:latest#20/collectionName=main "files=**/language=eng/*.parquet" "select=url,title,curlielabels_en" urls_file=url-list.csv

The command will return text and fields of the index comparable to the less command.

Note

Note that owilix querying is done on a file basis. The index shard files are accessed via the network similar to local files, which can be slow and subject to network errors. Consequently, query process first collect the number of files selected (via selector and the optional files parameter) and then conduct the query over those files. So the more files you select, the longer it will take.

Storing filtered results as json#

Working with text is good for manual inspection, but not suitable for further processing. Thus, the sites sub-command also allows for exporting the data as jsonl (a json entry per line).

> owilix query sites --remote all:latest#20/collectionName=main "files=**/language=eng/*.parquet" "select=url,title,curlielabels_en" urls_file=url-list.csv as_json=True json_file=url-list-result.json
Found 4 unique sites to query: [{'url_scheme': 'https', 'url_subdomain': 'www', 'url_domain': 'blogger', 'url_suffix': 'com', 'url_path': '/blog'}, {'url_scheme': 'https', 'url_subdomain': '', 'url_domain': 'com', 'url_suffix': '', 'url_path': '/'},
{'url_scheme': 'https', 'url_subdomain': 'www', 'url_domain': 'opensearchfoundation', 'url_suffix': 'org', 'url_path': '/'}, {'url_scheme': 'https', 'url_subdomain': '', 'url_domain': 'uni-passau', 'url_suffix': 'de', 'url_path': '/'}]
Fetching remote datasets for specifier up:2025-01-31#30/collectionName=main
Found 31 remote datasets
Fetching dataset files  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 fadda79e-d664-11ef-a926-0242ac150006/OWI-Open Web Index-main.owi@it4i-2025-01-18:2025-01-18
Found '22361' parquet files in 1 filesystems. Running site query.

The results would be stored in the jsonl file specified.

Indexing Dataset Shards#

Indexing dataset shards now combines the owilix plugin with the owilix query capability. We basically make a less on the data (i.e. listing datasets) with a specific query. The output is then piped into the owilix opensearch plugin.

owilix query less --remote it4i:latest "select=url,id,title" as_json=True | owilix plugin owilix.plugins.push.opensearch.OpenSearchIndexer index owilix_test

The above command would pipe the url, id and title into an opensearch instance defined via the environment variables.

Shell Script#

In the scripts folder in the owilixsource code you can find the shell script ours_index.sh which we use to have a simple title based search engine in the Openwebindex Dashboard. You can take it as inspiration for developing your own import scripts.