Creating the OWI#
The most tricky part is in developing a scalable pipeline to create a full open web index. The following animation shows the general working of our pipeline and the basic usage principle. Color indicates equal web pages (potentially crawled at different points in time).
Crawling, preprocessing and indexing is conducted in batches, where full pipelines should run at one data center to ensure data locality. Different tiers are used to separate the processes and software stacks involved and our central frontier coordinates the crawling process and thus the data partitioning.
Crawling data is written to the data infrastructure in a specific folder structure, which is shared accross all data centers.
The crawl data can be accessed using iRoDS, which federates requests between data centers.
Via our toolkit, a client application pulls the data from previous days or from today and imports it into the local client (e.g. for Lucene or Solr). Deduplication happens on the client side during import for daily updates.
We plan to provide already deduplicated core indices in the future, which can serve as starting point for daily updates
Open Source Software Repositories#
Each of the parts we developed to build and sustain the OWI and OWSE-HUB is published as an open-source software repository. Archived version of our artefacts can be found at the owseu zenodo community:
Open Web Search Crawler (OWler) CCT – Deployed on Apache Storm – Continuous Service
Resiliparse: WARC/HTML parsing and preprocessing library PET – Library used in Spark Jobs
Resilipipe: Cluster-based web content analysis with Resiliparse. PET – Spark Jobs running Preprocessing and Enrichment
TIRA: platform for replicability and comparison of information retrieval experiments PPET – Platform running as continuous services (at CSC) for evaluating retrieval pipelines (including our plugins)
Open Web Search Indexer IST – Spark Job running the OWS Indexer using the CIFF Toolkit
CIFF toolkit: library for processing WARC files IST – Library used in indexing and for end users working with CIFF Files
CIFF to Lucene index converter IST – Library used in indexing and for end users working with CIFF Files