Preprocessing and Semantic Enrichment#

Preprocessing and content analysis encompasses the following two cluster tiers:

  • The Preprocessing and Enrichment Tier (PET) takes the WARC files from the S3 Bucket filled by the CCT and extracts cleaned HTML and metadata. After the first year, the metadata consists of the page’s language and several properties derived from the URL. Eventually, more metadata — like topic, geo-information and entities — will also be extracted. Following the partitioning of the CCT, each data center should have a dedicated PET for the WARC files stored at that data center. The metadata extracted by the PET is stored in Parquet format.

  • The Preprocessing Plugins Evaluation Tier (PPET) is another singleton tier that enables the evaluation of plugins for the content analysis library. In order to expand enrichment capabilities, both project members as well as third parties may develop plugins to be used in the PET.

  • To ensure good quality and sufficient throughput, candidate plugins first have to perform their enrichment task on problem-specific benchmarking data (e.g. a classification dataset for a plugin performing webpage classification), using an instance of the TIRA platform ([FrobeWK+23]) hosted at one of the data centers. This platform provides the means for evaluation as a service with a focus on information retrieval research. It can host shared tasks on a given research problem, run submitted software in virtual machines and thereby create reproducible results.

Resilipipe#

Resilipipe is an open source software framework that implements a scalable cluster-based web content analysis pipeline for web archive data based on Resiliparse. It can be run on HPC clusters using Apache Spark and Magpie and is part of our workflows. Resilipipe takes in the WARC files produced by crawling and processes them in parallel on a Spark cluster. This processing consists of parsing the WARC records, extracting cleaned HTML and relevant metadata, and saving the results as Parquet files. .

Parquet based storage of metadata and preprocessing results#

Results from preprocessing are stored in the form of a Parquet file. The Parquet file format is a column-oriented file format for efficient data storage and retrieval, basically a table with special ways to access individual columns and rows efficiently.

Below you can find an example row of the Parquet file for the first webpage and Schema version 0.1.0. For more details on the extracted fields please refer to the README of Resilipipe.

id

id

f5ebe97d74c3a1866a1298e178c27c5e44e822ff18720e284ab4f74145837399

record_id

4f475dda-643a-46b3-ae45-a5598d3229a9

title

The World Wide Web project

plain_text

The World Wide Web project World Wide Web The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents. […] Getting the code by anonymous FTP , etc.

json-ld

microdata

warc_date

2024-12-05T08:38:42Z

warc_ip

2001:1458:201:a4::100:1a0

url

%3Chttps:/info.cern.ch/hypertext/WWW/TheProject.html%3E

url_scheme

url_path

%3Chttps:/info.cern.ch/hypertext/WWW/TheProject.html%3E

url_params

url_query

url_fragment

url_subdomain

url_domain

url_suffix

url_is_private

False

mime_type

text/html

charset

content_type_other

http_server

Apache

language

eng

valid

True

warc_file

../data/www_project.warc.gz

ows_canonical

ows_resource_type

ows_curlielabel

ows_index

ows_genai

ows_genai_details

ows_fetch_response_time

ows_fetch_num_errors

schema_metadata

[(‘schema_version’, ‘0.1.0’)]

outgoing_links

image_links

video_links

iframes

curlielabels

curlielabels_en

address

collection_indices

Modular Design of Resilipipe#

Each record in a WARC file is processed sequentially in a modular fashion as illustrated below.

alt text

Building on the results from the standard processing steps, more extensive parsing is handled by parsing modules (“plugins”). One example of such a module is the Geoparsing and -tagging module developed by DLR. An extensive description can be found in the corresponding README of Resilipipe. To allow for adaptations to specific use cases, modules can be added to and removed from Resilipipe’s processing steps.

Follow these instructions to develop your own modules.

Preprocessing Plugins Evaluation#

The plugins added to the Resilipipe will need to meet criteria with respect to throughput and output quality. In order to standardize and facilitate the evaluation of plugins, the TIRA platform will be hosted at one of the data centers. The TIRA Integrated Research Architecture focuses on hosting shared tasks and facilitates the submission of software for evaluation. The platform encapsulates submitted software into virtual machines and disconnects them from the internet (sandboxing) before allowing it to process data hosted in TIRA. This approach has the advantage of (i) being able to easily re-evaluate submitted software and (ii) ensuring that software does not leak data to the outside. On the platform, three types of benchmarking processes are required to assess new plugins:

  1. Efficiency benchmarks to evaluate the scalability

  2. Test benchmarks for broken pages to ensure proper handling of these

  3. Quality benchmarks to evaluate the performance on the desired task While the first and second process are largely independent of the specific content analysis that the plugin performs, the third process is application-specific and thus requires dedicated data. Whenever a new type of plugin should be added to the content analysis library, the specific dataset needs to be provided by the party that developed the plugin. In order to ensure the validity of the quality benchmark, the chosen dataset should ideally already be used as a standard benchmark, for instance by scientific publications in the relevant field. If a new plugin is developed for an existing task, the existing dataset for that task