Tutorial 3. Slicing Data with owilix#

(#howto-slicing)

owilix allows to run queries over parquet files on remote or local repos using duckdb. This means you can slice (i.e. select subsets of columns) and dice (i.e. you can select subsets of rows) from the parquet files of the index (i.e not the index ciff files themselves). This gives you very powerful means to select the data.

For example

  • you can select all pages with url_suffix=at

  • you can select all pages containing shop in their url

  • you can select all pages allowed for gen_ai

  • you can select all pages with a certain curlie label (which was the domain label before)

When running a query, it is considered that a new dataset is created and you can specify a collection name where this dataset should be created in. Also, you can append a query to an existing dataset but note, that data is overwritten on a directory level.

But now lets dig in to some examples.

TL;DR

  1. Slicing can be done by running a SQL query against selected data shards and storing the results locally as new datasets. For example:

owilix query slice --remote lrz:2023-10-25#1 "where=WHERE url like '%impressum%' OR url like '%legal%' OR url like '%imprint%' OR url like '%terms%' OR url like '%privacy%' OR url like '%contact%' OR url like '%agreement%'" import_collection=legal
  1. We prepare specially sliced partitions under the name of collections You can get a sub-index with only legal information on websites using

owilix remote pull all/collectionName=legal

  1. Afterwards, you can browse it or convert it to json (setting as_json=True or False):

owilix query less --local all/collectionName=legal as_json=True select=url,title,domain_label "where=WHERE url like '%impressum%' or url like '%imprint%'" 

The data can also be downloaded as sample data, so you don’t need to authenticate.

Adding to an existing collection#

While the main index is partitioned in daily shards, collection indices might span a longer duration. In order to create such a larger collection, you can use the slice command with specifying an internalID of a dataset. The slice command will then look for the collection with internalID locally and, if it is found, add the new slice to it. Note that the metadata will keep the provenance chain in the metadata field provenance and the changlog in the root of the directory will be updated.

owilix query slice --remote lrz:2023-10-27#2 "where=WHERE url like '%impressum%' OR url like '%legal%' OR url like '%imprint%' OR url like '%terms%' OR url like '%privacy%' OR url like '%contact%' OR url like '%agreement%'" import_collection=legal internalID=6adad19-02be-45ce-ae93-53a0d7a506f4

Caution

For keeping the slicing process controlled we do not recommend to slice over all datasets, but to use maximum of two to four days. Since queries are directly run over parquet files, network failures might require to conduct retries, which would take long over a larger batch of data.

Note

Note that the dataset structure will be kept the same and that files can be potentially overwritten. So when slicing datasets from the same days but different datacenters content might get overwritten.

Slice Provenance#

Slicing also keeps the provenance in the provenance metadata field. This allows to check, whether a certain collection has already integrated a main partition with the particular settings. When running new slice, already integrated main partitions will be ignored.

If you encounter the following message it means a dataset is not processed due to previous prvenance:

Dataset with id 26adad19-02be-45ce-ae93-53a0d7a506f4 /OWI-Open Web Index-{collectionName}.{resourceType}@{dataCenter}-{startDate}:{endDate} already contains the same files and query in its provenance list. Skipping the following datasets [‘OWI-Open Web Index-main.owi@lrz-2023-10-24’, ‘OWI-Open Web Index-main.owi@lrz-2023-10-25’]

By setting ignore_provenance=True in the slice command you can force to do the slice.

Converting slices to json#

Both, collection (indices) and the main index can be shown using the owilix query less command and also converted into json.

So while

owilix query less --local all/collectionName=legal select=url,title,domain_label

would print nicely formatted entries at the console, the command

owilix query less --local all/collectionName=legal as_json=True where="WHERE url like '%impressum%' or url like '%imprint%'"

would print a json, which can be piped into a file and used somewhere else.