OWI Access via LEXIS and py4lexis#

TL;DR

  • The LEXIS Portal manages dataset shards (and workflows)

  • LEXIS contains a central dataset catalog and manages the decentralised repositories holding the data

  • Preprocssed and index data is available publicly, while WARC data access is restricted (contact us).

  • Access to all dataset shards (including public ones) requires authentication via EUDAT due to legal reason

  • The LEXIS Portal supports HTTP Download (which is not recommended)

  • The python library py4lexis provides a command line interface as well as python library for access, but leaves data set management to you.

The LEXIS Portal is the central portal to manages hosted by the different partners. So naturally, it is a first place to start daily index shards. Due to the potential size of index shards, the download has been realised as multi-step process outliend as follows.

Authentication#

LEXIS requires authenticated users even for public download. For your convenicence, we integrate the EUDAT authentication provider, which supports login from most European research insitution as well as common social logins with ORCID, Github etc. Please note that by logging into LEXIS and downloading the index you accept the current Open Web Index License (OWIL)

Listing Data Set Details#

After identifying the relevant dataset, you can inspect the details of the dataset and navigate its structure through the blue button and then “File List”. You can also download individual files by right-click and either download or open.

Of course, this step is optional.

Preparing the Download#

Independent whether you download a full dataset or a file, the download needs to be prepared first. To do so, select the dataset (or file) and click “download”. You will get notified that the download is prepared.

Download#

After the prepration has been successfull, the download can be initiated. You can either do this from the top of the screen with the download arrow

Preparation will take some time. You can find the current status of the preparation under Dashboard/Downloads

Download progress:

After the download, you can find the file in your download folder as download.gz

LEXIS Platform Documentation#

For further details and documentation, please see the LEXIS Documentationn

Scripting Downloads with py4lexis#

Manual download via the LEXIS portal is only interesting for getting some example dataset. Also, the download is slow as it is not done in parallel but via the central LEXIS portal. So we recommend to use scripts for managing regular download tasks.

However, the LEXIS Plattform offers a python-client called py4lexis for faster, parallel download and command-line (CLI) based dataset management.

We give some examples here on how to use py4lexis with the Open Web Index, but link to the online documentation for further details.

For the examples here we assume you have a Linux shell or Mac Shell with python 3.10 available.

Starting a session#

You need to login to a session, which then gives an inner console that is authenticated towards the LEXIS plattform

You can do this via

python -m py4lexis.cli session login-url

Listing datasets and content of a dataset#

The get-all-dataset command allows you to list all datasets and filter according to projects, zones etc. Please note that LEXIS also hosts other projects besides OpenWebSearch, so you might want to apply a filter on the project. You can find all options by using get-all-dataset --help

> python -m py4lexis.cli datasets get-all-datasets --filter-project openwebsearch
Welcome to the Py4Lexis!
You have been successfully logged in LEXIS session.
Retrieving data of the datasets...
Converting HTTP content from JSON to pandas Dataframe...
Data of the datasets successfully retrieved (and converted)....
Formatting pandas DataFrame into ASCII table...
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| Title                                                    | Access   | Project        | Zone        | InternalID                           | CreationDate        |
+==========================================================+==========+================+=============+======================================+=====================+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/12/26        | project  | openwebsearch  | OWSLRZZONE  | 03d73c63-9524-4ecb-8b6e-f7f59b6a9676 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2024/1/20         | project  | openwebsearch  | OWSLRZZONE  | 043344e1-13ad-4ede-9a19-62ad918b3816 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/11/14        | project  | openwebsearch  | OWSLRZZONE  | 078d3a3f-fb33-4ce6-9142-2f83543cd231 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2024/1/21         | project  | openwebsearch  | OWSLRZZONE  | 07a3682f-bb3d-458f-8b81-06a9e71df96e | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/11/13        | project  | openwebsearch  | OWSLRZZONE  | 14a4bb5d-e6d3-4bc8-a00c-14c047950a78 | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+
| The Open Web Index (Raw Data)@BAdW-LRZ 2023/11/21        | project  | openwebsearch  | OWSLRZZONE  | 1e0c1a10-cf36-4d2c-ad2d-40dd476e8c9c | 2024-04-12 07:30:54 |
+----------------------------------------------------------+----------+----------------+-------------+--------------------------------------+---------------------+

The access level determines whether you need to be a member of openwebsearch - which is ACCESS=project - or not. Note that all raw crawl data has access level project and thus is only available upon request. Preprocessed and indexed datasets are available for everybody authenticated.

py4lexis does not support project specific filtering, i.e. based on metadata for our index files. However, you can use standard tools like grep to filter on text.

> python -m py4lexis.cli datasets get-all-datasets|grep OWI | grep 2024-03 
mgrani@mgrani-Precision-3660:~$ python -m py4lexis.cli datasets get-all-datasets|grep OWI | grep 2024-03 
| TOWI-The Open Web Index-2023-12-25@lrz-eng-multi         | public   | openwebsearch  | IT4ILexisV2 | f3ffccb4-dc63-11ee-a201-0242c0a87004 | 2024-03-07 09:20:36 |
| TOWI-The Open Web Index-2023-12-25@lrz-eng               | public   | openwebsearch  | IT4ILexisV2 | c46e4d62-dc5f-11ee-ad7e-0242c0a87004 | 2024-03-07 08:50:38 |
| OWI-Open Web Index-legal.owip@it4i-2024-03-12:2024-03-31 | public   | openwebsearch  | IT4ILexisV2 | 0dac12be-52f5-11ef-a60f-0242c0a81003 | 2024-08-05 06:36:32 |
| OWI-Open Web Index-main.owi@csc-2024-03-01               | public   | openwebsearch  | IT4ILexisV2 | c378f8da-5680-11ef-aad9-0242ac130005 | 2024-08-09 19:21:15 |
| OWI-Open Web Index-main.owi@csc-2024-03-05               | public   | openwebsearch  | IT4ILexisV2 | 439467a0-69f7-11ef-aad9-0242ac130005 | 2024-09-03 13:49:52 |
| OWI-Open Web Index-main.owi@csc-2024-03-02               | public   | openwebsearch  | IT4ILexisV2 | 4389c08a-691f-11ef-aad9-0242ac130005 | 2024-09-02 12:03:31 |
| OWI-Open Web Index-main.owi@csc-2024-03-04               | public   | openwebsearch  | IT4ILexisV2 | 3c19d116-69d6-11ef-aad9-0242ac130005 | 2024-09-03 09:56:45 |
| OWI-Open Web Index-main.owi@csc-2024-03-03               | public   | openwebsearch  | IT4ILexisV2 | 2cd1d28e-6941-11ef-aad9-0242ac130005 | 2024-09-02 16:06:12 |
mgrani@mgrani-Precision-3660:~$ 

Working with datasets#

Every dataset in LEXIS is identified via an UUID, refered to as InternalID. So if you work with a dataset, you have to specify the InternalID of that dataset. Working with groups of dataset is not supported yet, but you can use the OpenWebSearch.eu CLI that builds on py4lexis named owilix (see next section). Beyond the internalID, py4lexis also requires the access level (mostly public) and the project name openwebsearch for dataset access

Listing content of datasets

The following command lists the content of a dataset

> python -m py4lexis.cli datasets get-content-of-dataset 2cd1d28e-6941-11ef-aad9-0242ac130005 public openwebsearch
Welcome to the Py4Lexis!
You have been successfully logged in LEXIS session.
Retrieving data of files in the dataset...
Converting HTTP content from JSON to pandas Dataframe...
Content of the dataset was successfully retrieved (and converted)...
Formatting pandas DataFrame into ASCII table...
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| Dir/File-name       | Path                                                        | Type   |       Size | CreateTime          | Checksum   |
+=====================+=============================================================+========+============+=====================+============+
| index.ciff.gz       | year=2024/month=3/day=3/language=aar/index.ciff.gz          | file   |      43281 | 2024-09-02T15:50:13 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=aar/metadata_0.parquet     | file   |     116996 | 2024-09-02T15:50:12 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=abc/index.ciff.gz          | file   |       1451 | 2024-09-02T15:57:24 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=abc/metadata_0.parquet     | file   |      18836 | 2024-09-02T15:57:24 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=abk/index.ciff.gz          | file   |      14041 | 2024-09-02T15:48:03 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=abk/metadata_0.parquet     | file   |      54112 | 2024-09-02T15:48:03 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=ace/index.ciff.gz          | file   |       1438 | 2024-09-02T15:50:45 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=ace/metadata_0.parquet     | file   |      25103 | 2024-09-02T15:50:44 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=afr/index.ciff.gz          | file   |    3176754 | 2024-09-02T15:48:05 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=afr/metadata_0.parquet     | file   |   12290453 | 2024-09-02T15:48:04 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=aka/index.ciff.gz          | file   |       5236 | 2024-09-02T15:59:30 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| metadata_0.parquet  | year=2024/month=3/day=3/language=aka/metadata_0.parquet     | file   |      27690 | 2024-09-02T15:59:29 | None       |
+---------------------+-------------------------------------------------------------+--------+------------+---------------------+------------+
| index.ciff.gz       | year=2024/month=3/day=3/language=all/index.ciff.gz          | file   |      61807 | 2024-09-02T15:59:25 | None       |
....

Downloading datasets

Downloading works similar to listing the content of the dataset, just with the download-dataset command. Please note that similar as with the portal, py4lexis also needs to prepare the dataset and download via http, which can be slow.

python -m py4lexis.cli datasets download-dataset 2cd1d28e-6941-11ef-aad9-0242ac130005 public openwebsearch
Welcome to the Py4Lexis!
You have been successfully logged in LEXIS session.
Submitting download request on server...
Download submitted!
Checking the status of download request...
Download request not ready yet, 200/0 retries
Download request not ready yet, 200/1 retries
Download request not ready yet, 200/2 retries

Preparation time currently takes quite some time, especially for large datasets. However, we are working on a more efficient way for download.