Tutorial 3. Slicing Data with owilix
#
(#howto-slicing)
owilix
allows to run queries over parquet files on remote or local repos using duckdb.
This means you can slice (i.e. select subsets of columns) and dice (i.e. you can select subsets of rows) from the parquet files of the index (i.e not the index ciff files themselves).
This gives you very powerful means to select the data.
For example
you can select all pages with
url_suffix=at
you can select all pages containing
shop
in their urlyou can select all pages allowed for
gen_ai
you can select all pages with a certain
curlie
label (which was the domain label before)
When running a query, it is considered that a new dataset is created and you can specify a collection name where this dataset should be created in. Also, you can append a query to an existing dataset but note, that data is overwritten on a directory level.
But now lets dig in to some examples.
TL;DR
Slicing can be done by running a SQL query against selected data shards and storing the results locally as new datasets. For example:
owilix query slice --remote lrz:2023-10-25#1 "where=WHERE url like '%impressum%' OR url like '%legal%' OR url like '%imprint%' OR url like '%terms%' OR url like '%privacy%' OR url like '%contact%' OR url like '%agreement%'" import_collection=legal
We prepare specially sliced partitions under the name of
collections
You can get a sub-index with only legal information on websites using
owilix remote pull all/collectionName=legal
Afterwards, you can browse it or convert it to json (setting as_json=True or False):
owilix query less --local all/collectionName=legal as_json=True select=url,title,domain_label "where=WHERE url like '%impressum%' or url like '%imprint%'"
The data can also be downloaded as sample data, so you don’t need to authenticate.
“Give me all legal pages” as a concrete Example#
The following queries select all pages from the lrz data center in the range 2023-10-24 to 2023-10-25, where the url contains either impressun or legal or imprint or terms or privacy or contact or agreement. Such urls usually contain legal information of a website, and thus we import it in a collection called legal.
owilix query slice --remote lrz:2023-10-25#1 "where=WHERE url like '%impressum%' OR url like '%legal%' OR url like '%imprint%' OR url like '%terms%' OR url like '%privacy%' OR url like '%contact%' OR url like '%agreement%'" import_collection=legal
Issue the command runs the query and provides progress bars to track the overview.
Since we find this a quite important subset of the web, we created such a “legal” collection as pre-computed.
Adding to an existing collection#
While the main
index is partitioned in daily shards, collection indices might span a longer duration.
In order to create such a larger collection, you can use the slice command with specifying an internalID
of a dataset.
The slice command will then look for the collection with internalID
locally and, if it is found, add the new slice to it.
Note that the metadata will keep the provenance chain in the metadata field provenance
and the changlog in the root of the directory will be updated.
owilix query slice --remote lrz:2023-10-27#2 "where=WHERE url like '%impressum%' OR url like '%legal%' OR url like '%imprint%' OR url like '%terms%' OR url like '%privacy%' OR url like '%contact%' OR url like '%agreement%'" import_collection=legal internalID=6adad19-02be-45ce-ae93-53a0d7a506f4
Caution
For keeping the slicing process controlled we do not recommend to slice over all datasets, but to use maximum of two to four days. Since queries are directly run over parquet files, network failures might require to conduct retries, which would take long over a larger batch of data.
Note
Note that the dataset structure will be kept the same and that files can be potentially overwritten. So when slicing datasets from the same days but different datacenters content might get overwritten.
Slice Provenance#
Slicing also keeps the provenance in the provenance
metadata field.
This allows to check, whether a certain collection has already integrated a main partition with the particular settings.
When running new slice, already integrated main partitions will be ignored.
If you encounter the following message it means a dataset is not processed due to previous prvenance:
Dataset with id 26adad19-02be-45ce-ae93-53a0d7a506f4 /OWI-Open Web Index-{collectionName}.{resourceType}@{dataCenter}-{startDate}:{endDate} already contains the same files and query in its provenance list. Skipping the following datasets [‘OWI-Open Web Index-main.owi@lrz-2023-10-24’, ‘OWI-Open Web Index-main.owi@lrz-2023-10-25’]
By setting ignore_provenance=True
in the slice command you can force to do the slice.
Converting slices to json#
Both, collection (indices) and the main index can be shown using the owilix query less
command and also converted into json.
So while
owilix query less --local all/collectionName=legal select=url,title,domain_label
would print nicely formatted entries at the console, the command
owilix query less --local all/collectionName=legal as_json=True where="WHERE url like '%impressum%' or url like '%imprint%'"
would print a json, which can be piped into a file and used somewhere else.