Tutorial 1. Data Download with owilix#

The project provides the owilix command line tool for accessing OWI data. owilix can be best understood as git for the OWI, i.e. to pull and push data. This tutorial demonstrates, how owilix can be used for accessing and using the Open Web Index.

OWI Datasets#

The Open Web Index is provided as a set of datasets, published on a daily basis per data center.

The following figure depicts the basic components of a dataset, which is basically a hierarcial partitioning by date and language.

  • parquet files contain the metadata obtained from resiliparse/resilipipe as well as the plain text.

  • ciff files contain the inverted index created via the OWI Indexer

owilix now supports the synchronisation of remote with local datasets similar to git for source code.

Installing owilix#

owilix depends on a large number of libraries, so we recommend to installing it into a separate python environment managed by pyenv or conda. The CLI also requires python 3.11 and either Linux or MacOSx (on Windows, they WSL might work but has not been tested).

Usually running the following lines should give you a clean environment (when using conda):

# Create a new environment
conda create -n owi pip python=3.11
conda activate owi
# Install required packages
pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
# Verify installation
owilix --help

Note that you always need to activate the owi environemnt when using owilix (i.e. conda activate owi)

First check if everything works#

owilix supports simple listing of datasets. After installation it is recommended to run a simple remote ls all command to see whether everythign is working.

In case of problems, you can use

  • owilix logs errors to show error logs occuring,

  • owilix clean to clean potentially problematic files and

  • owilix remote doctor to check connection details

Columns to be displayed and the sort order can be configured via profiles and CLI parameters, but that is beyond this tutorial.

Looking at public datasets in the LEXIS Portal#

To see whether you can have access to the public datasets in the LEXIS Portal (and thus rule out any rights problem), you should navigate to the DataSets/Public menu in the LEXIS Portal, where you should see the following list of datasets:

Under the menu item Projects you can request access to the “openwebsearch” project so that you also could access warc datasets. However, this also requires to engage with us via the community platform or contact us via email

Listing Datasets and Specifiers#

Specifying Sets of Datasets#

owilix uses so called dataset specifiers to specify sets of datasets. e.g. all specifies all datasets available while all:latest gives you the latest dataset available in all data centers.

In general, specifiers have the following form: {datacenter|all}[:{YYYY-MM-DD|latest}][#{duration}]\{key1=value1;key2=value2} where

  • datacenter has the data center abbreviation (currently it4i, csc or lrz) or all to qualify all data centers.

  • duration is the number of days backward from the provided day

  • key=value are key and values for selecting dataset as a filter, i.e. all key=value pair need to be matched for a dataset to be selected.

Listing Datasets#

First we need to identify suitable datasets.

Lets inspect the latest datasets and 7 days back:

owilix remote ls all:latest#6 

which can give us something like:

Approximately 600 GB of data in two different datacenters.

To be more specific, we only pick the latest from datacenter it4i but also look at all files available for german (lanugage code =deu)

owilix remote ls it4i:latest files=**/language=deu/*

Pulling datasets#

Downloading datasets is easy: instead of the ls command, you use the pull command.

owilix remote pull it4i:latest files=**/language=deu/*

Note that pulling is file-based and synchronises with already existing files. So if we pull again, but now download all files, the deu files will not be transfered again (except if you specify overwrite). This also means, that if the download gets interrupted, it can be resumed again.

.

You can use owilix remote help push to see which other options are available

Consuming datasets#

Consuming datasets depends on your purpose and is not directly part of owilix.

You can either use the index and parquet files via our MOSAIC search engine or directly access the files in the download directory (note that hte default is ~/.owi/public/main/)

Querying data sets#

One possibility for accessing the data via owilix is by querying the data using owilix query.

owilix query less --local all:latest select=url,title,domain_label "where=url_suffix='at'"

would allow to browse the results similar to the linux command line tool “less”, but less fancy. We did use a more specific subset, namely all .at domains and we only display url, title and domain_label. Feel free to play around and explore the index.

Note that in the query case we need to specify on which repositories the specifier should be applied, i.e. local or remote, so that we can also run the query directly over remote data sets.

Under the hood, owilix uses duckdb to execute duckdb sql over parquet files, which can be either local or remote.