Data and Data Sets#
The Open Web Index (TOWI) will be published as daily snapshots per data center containing:
The index in CIFF Format
Metadata fields acompanying the index int the
parquet
format coming out of our preprocessing pipelineAuxillary files for future developments (e.g. additional metadata, dense vector embeddings)
Naming Conventions and Metadata#
Metadata are hosted in the LEXIS Plattform with a rich set of metadata fields with the following naming convention (to ease search):
Titel:
TOWI-The Open Web Index-<year>-<month>-<day>@<datacenter>-<language>
Creator/Contributor/Publisher/Owner
Datasets publisehd within the OpenWebSearch.eu project are published through a joint effort of all partners.
We thus refere to all partners of the consoritum by
The OpenWebSearch.eu Consortium
Rights / License: the datasets are licensed under the Open Web Index License (OWIL) . Please note that the license will be continuosly updated.
Resource Type:
Dataset
with a special sub typeOpen Web Index V1.0
Filename:
TOWI-<year>-<month>-<day>-<datacenter>-<language>-<form>.parquet
Placeholders:<year>
: the year of the snapshot<month>
: the month of the snapshot<day>
: the day of the snapshot<datacenter>
: the data center where the snapshot was taken:lrz
: Leibniz Supercomputing Centreit4i
: IT4Innovationscsc
: CSC
language
all
if all available languages are includedotherwise in the three character OSI format.
<form>
: the form of the file, i.e. either ‘singlefor sinlge file or
multi` for multiple files
The files are provided in ta
Download Facilities#
Downloading of dataset is support for via the LEXIS Portal either using the portal app or via Py4Lexis . Both require a login via B2ACCESS.
The LEXIS Portal Plattform supports downloading full dataset or single files.
Download via the LEXIS Plattform#
Login to LEXIS Portal. If you don’t have an account, you can create one via B2ACCESS. B2ACCESS supports different identity providers (mostly from Europe and academia)
Navigate to
Data Sets --> Public
and select the dataset (e.g. search forTOWI
, sort etc.)Click
Details
(Blue Button) to get more information about the dataset.Click
Download
to download the dataset or go toFile List
select a file and right click to selectDownload
.The download will be prepared. To download it, you need to wait until the download button on the top (
arrow-down
) shows the datasetYou can watch the status of the operation by going to
Dashboard-->Data Operations-->Downloads
. You can also download the datset from the list here, once preparation is finished.
Download via Py4Lexis #
Py4Lexis is a Python client for the LEXIS Plattform. It allows to download data from the LEXIS Plattform directly from Python.
We will release soon a library / documentation on how to use py4lexis.
Offload Facilities#
Via the , we also support more fine grained filtering and the offloading of data, i.e. data will be pushed from our data centers to your data storage systems.
We currently support OpenSearch, ElasticSearch and S3 as target storage systems.
Changes#
2024-03-10: release of first data example