Data and Data Sets#

The Open Web Index (TOWI) will be published as daily snapshots per data center containing:

  • The index in CIFF Format

  • Metadata fields acompanying the index int the parquet format coming out of our preprocessing pipeline

  • Auxillary files for future developments (e.g. additional metadata, dense vector embeddings)

Naming Conventions and Metadata#

Metadata are hosted in the LEXIS Plattform with a rich set of metadata fields with the following naming convention (to ease search):

  • Titel: TOWI-The Open Web Index-<year>-<month>-<day>@<datacenter>-<language>

  • Creator/Contributor/Publisher/Owner

    • Datasets publisehd within the OpenWebSearch.eu project are published through a joint effort of all partners.

    • We thus refere to all partners of the consoritum by The OpenWebSearch.eu Consortium

  • Rights / License: the datasets are licensed under the Open Web Index License (OWIL) . Please note that the license will be continuosly updated.

  • Resource Type: Dataset with a special sub type Open Web Index V1.0

  • Filename: TOWI-<year>-<month>-<day>-<datacenter>-<language>-<form>.parquet Placeholders:

    • <year>: the year of the snapshot

    • <month>: the month of the snapshot

    • <day>: the day of the snapshot

    • <datacenter>: the data center where the snapshot was taken:

      • lrz: Leibniz Supercomputing Centre

      • it4i: IT4Innovations

      • csc: CSC

    • language

      • all if all available languages are included

      • otherwise in the three character OSI format.

    • <form>: the form of the file, i.e. either ‘singlefor sinlge file ormulti` for multiple files

The files are provided in ta

Download Facilities#

Downloading of dataset is support for via the LEXIS Portal either using the portal app or via Py4Lexis . Both require a login via B2ACCESS.

The LEXIS Portal Plattform supports downloading full dataset or single files.

Download via the LEXIS Plattform#

  1. Login to LEXIS Portal. If you don’t have an account, you can create one via B2ACCESS. B2ACCESS supports different identity providers (mostly from Europe and academia)

  2. Navigate to Data Sets --> Public and select the dataset (e.g. search for TOWI, sort etc.)

  3. Click Details (Blue Button) to get more information about the dataset.

  4. Click Download to download the dataset or go to File List select a file and right click to select Download.

  5. The download will be prepared. To download it, you need to wait until the download button on the top (arrow-down) shows the dataset

  6. You can watch the status of the operation by going to Dashboard-->Data Operations-->Downloads. You can also download the datset from the list here, once preparation is finished.

Download via Py4Lexis #

Py4Lexis is a Python client for the LEXIS Plattform. It allows to download data from the LEXIS Plattform directly from Python.

We will release soon a library / documentation on how to use py4lexis.

Offload Facilities#

Via the , we also support more fine grained filtering and the offloading of data, i.e. data will be pushed from our data centers to your data storage systems.

We currently support OpenSearch, ElasticSearch and S3 as target storage systems.

Changes#

  • 2024-03-10: release of first data example