OWI Access#
TL;DR
The OWI is delivered as daily index shards
OWI shards are stored in a federated iRODS infrastructure via the LEXIS platform
Metadata per dataset allow better findability and selection of datasets
OWI shards are structured in a date/language partitioned folder structure, with
metadata in parquet files and the
index in CIFF files (common index file format)
Several access possibilities exist: direct download via LEXIS, scripting via py4lexis or the OWI Cli tool
owilix
As of September 2024, the OWI technology stack has been successfully deployed across multiple data centers, including BADW-LRZ, IT4I@VSB, and CSC. The pipeline processes daily data to generate index shards — segments of the complete index created from daily processed data. These index shards are crucial outputs published as LEXIS Datasets on a federated iRODS infrastructure, which unifies data access across centers. To streamline operations, the preprocessing and indexing components of the workflow run within a single HPC workflow on the LEXIS platform.
List of Daily Index Shards and Dashboard#
You can find the list of daily index shards either in the LEXIS Portal after logging in or at the OpenWebIndex.eu Dashboard.
The Dashboard offers further statistics, like the amount of crawled data and the number of published dataset.
The Dashboard lso
Structure of Daily Index Shards#
The final product of the full pipeline described above — the Open Web Index — consists of all public datasets produced by the federated datastructure and published through LEXIS. The Open Web Index is provided as daily shards per data-center in the form of so-called data sets. We aim to have a maximum delay of 1 day from crawling the data to preprocessing / indexing.
A data set follows a particular folder structure, as depicted in the image on the right and has associated metadata containing elements of the Data Cite vocabulary3 and application-specific metadata, particularly start data, end date and collection name. The changelog.json file contains potential changes to an index partition, like items removed due to take down requests. Each dataset contains CIFF and Parquet files, partitioned across language. Parquet files contain the metadata, while CIFF files contain the usable index - currently an inverted index.
Access control to LEXIS (and thus, the OWI) is arranged through EUDAT / B2ACCESS . To use the Open Web Index, downstream search engines would download the CIFF files they are interested in (e.g. with a specific language or from a specific date range), and import them into a search engine of their choice. Alongside the CIFF files, the metadata Parquet files can be downloaded for additional use in a search engine. For instance, the cleaned text can be used for snippet extraction, and the metadata fields can be used to enrich or filter the search results obtained by a full-text search on the index.
Folder Structures#
Daily shards are stored at remote iRoDS server which can be thought of as a remote filesystem.
Folder Structure for Daily Shards#
As described above, we created a daily dataset for the OWI and for every configured OWI-CI.
/ZONE/public<projectid>/<uuid> # zone and project id
year={YYYY} # year of the slice
month={MM} # month of the slice
day={DD} # day of the slice
language={LANG} # language partitions
index.ciff.gz # ciff file containing the index
metadata-{num?}.parquet # parquet file containing the metadata of the index
Zones usually refer to the data center the data is stored in. Currently we support two zones:
IT4ILexisV2 at the IT4I data center
OWSLRZZONE at the LRZ data center
Language is a 3 digit language code standarsd
Folder Structure for WARC Data#
WARC data is stored in a similar folder structure than index shards.
/ZONE/proj<projectid>/<uuid> # zone and project id
year={YYYY} # year of the slice
month={MM} # month of the slice
day={DD} # day of the slice
crawler={_crawler_name_}/ # Name of the crawler
HHmmss-{int}.warc.gz # single warc file. Namen scheme subject to change
Available Data Fields#
Data and metadata for individual web-pages are avaialbe in the .parquet
files as row/column structure.
Parquet files are created during preprocessing via the resilipipe and contain the following fields:
Schema Version 0.1.0
Column |
Description |
Pyspark Datatype |
---|---|---|
id |
Unique ID based on hash of the URL and crawling time |
|
record_id |
UUID of the WARC record |
|
title |
Title from the HTML |
|
plain_text |
Cleaned text from the HTML |
|
json-ld |
String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents) |
|
microdata |
String list of HTML Microdata (http://www.w3.org/TR/microdata/#json) |
|
warc_date |
Date from the WARC header |
|
warc_ip |
IP Address from the WARC header |
|
url |
Full URL |
|
url_scheme |
URL scheme specifier |
|
url_path |
Hierarchical path after TLD |
|
url_params |
Parameters for last path element |
|
url_query |
Query component |
|
url_fragment |
Fragment identifier |
|
url_subdomain |
Subdomain of the network location |
|
url_domain |
Domain of the network location |
|
url_suffix |
Suffix according to the Public Suffix List |
|
url_is_private |
If the URL has a private suffix |
|
mime_type |
MIME-Type from the HTTP Header |
|
charset |
charset from the HTTP Header |
|
content_type_other |
List of key, value pairs from the content type that could not be parsed into MIME-type or charset |
|
http_server |
Server from the from the HTTP Header |
|
language |
Language as identified by language.py; Code according to ISO-639 Part 3 |
|
valid |
|
|
warc_file |
Name of the original WARC-file that contained record |
|
ows_canonical |
The canonical link if it exists |
|
ows_resource_type |
Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with “Owler” |
|
ows_curlielabel |
One of the 15 Curlie top level labels |
|
ows_index |
|
|
ows_genai |
|
|
ows_genai_details |
If |
|
ows_fetch_response_time |
Fetch time in ms |
|
ows_fetch_num_errors |
Number of errors while fetching (Timeout is the most prominent fetch error) |
|
schema_metadata |
List of key, value pairs that contain global settings like the |
|
Additional columns can be added by providing modules as outlined in the respective README.
One module is outgoing links detection
.Column |
Description |
Pyspark Datatype |
---|---|---|
outgoing_links |
List of all hyperlinks in the HTML that start with ‘http’ |
|
image_links |
List of all links to images in the HTML that start with ‘http’ |
See |
video_links |
List of all links to videos in the HTML that start with ‘http’ or iframes with a video |
See |
iframes |
List of tuples for nodes that contain an iframe (and are not a video) |
See |
curlielabels |
List of language specific domain labels according to Curlie.org |
|
curlielabels_en |
List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano |
|
address |
List of dictionaries containing extracted location and coordinates |
See |
collection_indices |
List of collection indices that a record belongs to. Are defined via |
`ArrayType(StringType()) |
Contribute
Resilipipe is modular and you can add you own modules. But be aware that they need to scale properly!
We are currently working on geo-coding [FH24], trigger warnings [WWSchroder+23] and genre detection [SE08] as additional metadata
Used Metadata for Describing Datatasets#
|IRODS| allows to add metadata to every folder, which is then used to serve datasets via LEXIS Portal and to ease searching for proper datasets.
We follow basic Dublin Core Metadata and extend by application specific metadata.
Metadata Fiels Version 0.2.0 (currently in use)
Attribute |
Value |
Type |
---|---|---|
creator |
OpenWebSearch.eu Consortium |
|
contributor |
A1 Slovenija |
|
contributor |
Webis Group |
|
contributor |
CERN - The European Organization for Nuclear Research |
|
contributor |
CSC - IT Center for Science Ltd |
|
contributor |
German Aerospace Center (DLR) |
|
contributor |
Graz University of Technology |
|
contributor |
Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities |
|
contributor |
Open Search Foundation |
|
contributor |
Stichting Radboud University |
|
contributor |
University of Passau |
|
contributor |
VSB - TECHNICAL UNIVERSITY OF OSTRAVA |
|
relatedSoftware |
|
|
relatedSoftware |
|
|
alternateIdentifier |
abc |
|
startDate |
“2023-01-01” |
|
endDate |
“2023-01-01” |
|
lastChanged |
“2023-01-01 00:00:00” |
|
owner |
OpenWebSearch.eu Consortium |
|
publicationYear |
2024 |
|
publisher |
OpenWebSearch.eu Consortium |
|
resourceType |
“owi” (options: “owi”, “warc”, “owie”, “owii”, “owip”, “unknown”) |
|
subResourceType |
“ciff+parquet” |
|
rights |
“Open Web Index License V1.0” |
|
rightsIdentifier |
“OWIL V1.0” |
|
rightsURI |
https://ows.eu/owil/current |
|
publication |
https://doi.org/10.1002/asi.24818 |
|
provenance |
|
|
license |
https://ows.eu/owil/current |
|
dataCenter |
“unknown” (options: “lrz”, “it4i”, “csc”, “unknown”) |
|
collectionName |
“main” |
|
description |
“All OWI data indexed from {startDate} to (including) {endDate} at {dataCenter}” |
|
resourceTypeGeneral |
“Dataset” (options: “Dataset”) |
|
encryption |
“no” (options: “no”, “yes”) |
|
compression |
“no” (options: “no”, “yes”) |
|
totalSize |
0 |
|
fileCount |
0 |
|
objectCount |
0 |
|
title |
“OWI-Open Web Index-{collectionName}.{resourceType}@{dataCenter}-{startDate}:{endDate}” |
|
metadataSource |
Software+Version |
|
Details on some Metadata Fields#
Title
The title consist of the agreed index name OWI-Open Web Index
and collection name: either main
or a named sub-collection e.g. curlie
, legal
.
Creator, Owner, Publisher
Creator and contributor has been swapped: we whould have a single creator, the consortium, and contributions by all partners. It is also a UI thing.
Collection Name
Index shards can be pre-filtered for a particular purpose, called collections. Currently we support
the main collection forming the main index
the
legal collections
containing all contact, gdpr, legal notices pages found during our crawls.
Further collections can be added in the future.
resourceType
Determines the kind of dataset. The kind of dataset determines what files they contain respectively the applications they support. This is indicated in the resource type:
owi: contains parquet + ciff files (i.e. metadata plus index)
owii: contains only ciff files, but no metadata (not yet available)
owip: contains only parquet files with metadata
owie: contains vector-embeddings (not yet available)
warc: contains raw crawl data in warc format
The statistics below count the number of URLs in the parquet files as well as the compressed file size and file count over time for every resource/collection pair.
subResourceType
Determines the file type on a more fine grained level available in the index, i.e.
ciff
: indictates ciff files are presentparquet
: indicatest the availability of parquet filesemb
: contains embeddings<algorithm>
contains the algorithm
Note that subResourceTypes can be combined, i.e. “ciff+parquet”
DataCenter
Field indicating the data center the data set is stored in.
Provenance
space separated list on source (as uri’s) the dataset has been derived from. use owi://uuid
for indicating owi datasets.