OWI Access#

TL;DR

  • The OWI is delivered as daily index shards

  • OWI shards are stored in a federated iRODS infrastructure via the LEXIS platform

    • Metadata per dataset allow better findability and selection of datasets

  • OWI shards are structured in a date/language partitioned folder structure, with

    • metadata in parquet files and the

    • index in CIFF files (common index file format)

  • Several access possibilities exist: direct download via LEXIS, scripting via py4lexis or the OWI Cli tool owilix

As of September 2024, the OWI technology stack has been successfully deployed across multiple data centers, including BADW-LRZ, IT4I@VSB, and CSC. The pipeline processes daily data to generate index shards — segments of the complete index created from daily processed data. These index shards are crucial outputs published as LEXIS Datasets on a federated iRODS infrastructure, which unifies data access across centers. To streamline operations, the preprocessing and indexing components of the workflow run within a single HPC workflow on the LEXIS platform.

List of Daily Index Shards and Dashboard#

You can find the list of daily index shards either in the LEXIS Portal after logging in or at the OpenWebIndex.eu Dashboard.

The Dashboard offers further statistics, like the amount of crawled data and the number of published dataset.

The Dashboard lso

Structure of Daily Index Shards#

The final product of the full pipeline described above — the Open Web Index — consists of all public datasets produced by the federated datastructure and published through LEXIS. The Open Web Index is provided as daily shards per data-center in the form of so-called data sets. We aim to have a maximum delay of 1 day from crawling the data to preprocessing / indexing.

A data set follows a particular folder structure, as depicted in the image on the right and has associated metadata containing elements of the Data Cite vocabulary3 and application-specific metadata, particularly start data, end date and collection name. The changelog.json file contains potential changes to an index partition, like items removed due to take down requests. Each dataset contains CIFF and Parquet files, partitioned across language. Parquet files contain the metadata, while CIFF files contain the usable index - currently an inverted index.

Access control to LEXIS (and thus, the OWI) is arranged through EUDAT / B2ACCESS . To use the Open Web Index, downstream search engines would download the CIFF files they are interested in (e.g. with a specific language or from a specific date range), and import them into a search engine of their choice. Alongside the CIFF files, the metadata Parquet files can be downloaded for additional use in a search engine. For instance, the cleaned text can be used for snippet extraction, and the metadata fields can be used to enrich or filter the search results obtained by a full-text search on the index.

Folder Structures#

Daily shards are stored at remote iRoDS server which can be thought of as a remote filesystem.

Folder Structure for Daily Shards#

As described above, we created a daily dataset for the OWI and for every configured OWI-CI.

/ZONE/public<projectid>/<uuid>               # zone and project id
    year={YYYY}                              # year of the slice
        month={MM}                           # month of the slice
            day={DD}                         # day of the slice
                language={LANG}              # language partitions
                    index.ciff.gz            # ciff file containing the index
                    metadata-{num?}.parquet  # parquet file containing the metadata of the index 

Zones usually refer to the data center the data is stored in. Currently we support two zones:

  • IT4ILexisV2 at the IT4I data center

  • OWSLRZZONE at the LRZ data center

Language is a 3 digit language code standarsd

Folder Structure for WARC Data#

WARC data is stored in a similar folder structure than index shards.

/ZONE/proj<projectid>/<uuid>                 # zone and project id
    year={YYYY}                              # year of the slice
        month={MM}                           # month of the slice
            day={DD}                         # day of the slice
               crawler={_crawler_name_}/     # Name of the crawler 
                  HHmmss-{int}.warc.gz       # single warc file. Namen scheme subject to change 

Available Data Fields#

Data and metadata for individual web-pages are avaialbe in the .parquet files as row/column structure.

Parquet files are created during preprocessing via the resilipipe and contain the following fields:

Schema Version 0.1.0

Column

Description

Pyspark Datatype

id

Unique ID based on hash of the URL and crawling time

StringType()

record_id

UUID of the WARC record

StringType()

title

Title from the HTML

StringType()

plain_text

Cleaned text from the HTML

StringType()

json-ld

String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents)

StringType()

microdata

String list of HTML Microdata (http://www.w3.org/TR/microdata/#json)

StringType()

warc_date

Date from the WARC header

StringType()

warc_ip

IP Address from the WARC header

StringType()

url

Full URL

StringType()

url_scheme

URL scheme specifier

StringType()

url_path

Hierarchical path after TLD

StringType()

url_params

Parameters for last path element

StringType()

url_query

Query component

StringType()

url_fragment

Fragment identifier

StringType()

url_subdomain

Subdomain of the network location

StringType()

url_domain

Domain of the network location

StringType()

url_suffix

Suffix according to the Public Suffix List

StringType()

url_is_private

If the URL has a private suffix

BooleanType()

mime_type

MIME-Type from the HTTP Header

StringType()

charset

charset from the HTTP Header

StringType()

content_type_other

List of key, value pairs from the content type that could not be parsed into MIME-type or charset

MapType(StringType(), StringType())

http_server

Server from the from the HTTP Header

StringType()

language

Language as identified by language.py; Code according to ISO-639 Part 3

StringType()

valid

True: The record is valid; False: The record is no longer valid and should not be processed.

BooleanType()

warc_file

Name of the original WARC-file that contained record

StringType()

ows_canonical

The canonical link if it exists

StringType()

ows_resource_type

Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with “Owler”

StringType()

ows_curlielabel

One of the 15 Curlie top level labels

StringType()

ows_index

True: The content is allowed to be used for the purposes of web indexing/web search; False: The content cannot be used

BooleanType()

ows_genai

True: The content is allowed to be used for the purposes of developing Generative AI models; False: The content cannot be used

BooleanType()

ows_genai_details

If ows_genai=False, this provides additional context

StringType()

ows_fetch_response_time

Fetch time in ms

IntegerType()

ows_fetch_num_errors

Number of errors while fetching (Timeout is the most prominent fetch error)

StringType()

schema_metadata

List of key, value pairs that contain global settings like the schema_version

MapType(StringType(), StringType())

Additional columns can be added by providing modules as outlined in the respective README.

One module is outgoing links detection.

Column

Description

Pyspark Datatype

outgoing_links

List of all hyperlinks in the HTML that start with ‘http’

ArrayType(StringType())

image_links

List of all links to images in the HTML that start with ‘http’

See get_spark_schema in links.py

video_links

List of all links to videos in the HTML that start with ‘http’ or iframes with a video

See get_spark_schema in links.py

iframes

List of tuples for nodes that contain an iframe (and are not a video)

See get_spark_schema in links.py

curlielabels

List of language specific domain labels according to Curlie.org

ArrayType(StringType())

curlielabels_en

List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano

ArrayType(StringType())

address

List of dictionaries containing extracted location and coordinates

See get_spark_schema in geoparsing.py

collection_indices

List of collection indices that a record belongs to. Are defined via yaml files on the S3 instance

`ArrayType(StringType())

Contribute

  • Resilipipe is modular and you can add you own modules. But be aware that they need to scale properly!

  • We are currently working on geo-coding [FH24], trigger warnings [WWSchroder+23] and genre detection [SE08] as additional metadata

Used Metadata for Describing Datatasets#

|IRODS| allows to add metadata to every folder, which is then used to serve datasets via LEXIS Portal and to ease searching for proper datasets.

We follow basic Dublin Core Metadata and extend by application specific metadata.

Metadata Fiels Version 0.2.0 (currently in use)

Attribute

Value

Type

creator

OpenWebSearch.eu Consortium

str

contributor

A1 Slovenija

list[str]

contributor

Webis Group

list[str]

contributor

CERN - The European Organization for Nuclear Research

list[str]

contributor

CSC - IT Center for Science Ltd

list[str]

contributor

German Aerospace Center (DLR)

list[str]

contributor

Graz University of Technology

list[str]

contributor

Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities

list[str]

contributor

Open Search Foundation

list[str]

contributor

Stichting Radboud University

list[str]

contributor

University of Passau

list[str]

contributor

VSB - TECHNICAL UNIVERSITY OF OSTRAVA

list[str]

relatedSoftware

Resilipipe

list[str]

relatedSoftware

OWI Indexer

list[str]

alternateIdentifier

abc

list[str]

startDate

“2023-01-01”

str

endDate

“2023-01-01”

str

lastChanged

“2023-01-01 00:00:00”

str

owner

OpenWebSearch.eu Consortium

str

publicationYear

2024

int

publisher

OpenWebSearch.eu Consortium

str

resourceType

“owi” (options: “owi”, “warc”, “owie”, “owii”, “owip”, “unknown”)

str

subResourceType

“ciff+parquet”

str

rights

“Open Web Index License V1.0”

list[str]

rightsIdentifier

“OWIL V1.0”

str

rightsURI

https://ows.eu/owil/current

list[str]

publication

https://doi.org/10.1002/asi.24818

str

provenance

owi://mgrani@import

list[str]

license

https://ows.eu/owil/current

str

dataCenter

“unknown” (options: “lrz”, “it4i”, “csc”, “unknown”)

str

collectionName

“main”

str

description

“All OWI data indexed from {startDate} to (including) {endDate} at {dataCenter}”

str

resourceTypeGeneral

“Dataset” (options: “Dataset”)

str

encryption

“no” (options: “no”, “yes”)

str

compression

“no” (options: “no”, “yes”)

str

totalSize

0

int

fileCount

0

int

objectCount

0

int

title

“OWI-Open Web Index-{collectionName}.{resourceType}@{dataCenter}-{startDate}:{endDate}”

str

metadataSource

Software+Version

str

Details on some Metadata Fields#

Title

The title consist of the agreed index name OWI-Open Web Index and collection name: either main or a named sub-collection e.g. curlie, legal.

Creator, Owner, Publisher

Creator and contributor has been swapped: we whould have a single creator, the consortium, and contributions by all partners. It is also a UI thing.

Collection Name

Index shards can be pre-filtered for a particular purpose, called collections. Currently we support

  1. the main collection forming the main index

  2. the legal collections containing all contact, gdpr, legal notices pages found during our crawls.

Further collections can be added in the future.

resourceType

Determines the kind of dataset. The kind of dataset determines what files they contain respectively the applications they support. This is indicated in the resource type:

  • owi: contains parquet + ciff files (i.e. metadata plus index)

    • owii: contains only ciff files, but no metadata (not yet available)

    • owip: contains only parquet files with metadata

  • owie: contains vector-embeddings (not yet available)

  • warc: contains raw crawl data in warc format

The statistics below count the number of URLs in the parquet files as well as the compressed file size and file count over time for every resource/collection pair.

subResourceType

Determines the file type on a more fine grained level available in the index, i.e.

  • ciff: indictates ciff files are present

  • parquet: indicatest the availability of parquet files

  • emb: contains embeddings

  • <algorithm> contains the algorithm

Note that subResourceTypes can be combined, i.e. “ciff+parquet”

DataCenter

Field indicating the data center the data set is stored in.

Provenance

space separated list on source (as uri’s) the dataset has been derived from. use owi://uuid for indicating owi datasets.