Creating the OWI

Creating the OWI#

The most tricky part is in developing a scalable pipeline to create a full open web index. The following animation shows the general working of our pipeline and the basic usage principle. Color indicates equal web pages (potentially crawled at different points in time).

  • Crawling, preprocessing and indexing is conducted in batches, where full pipelines should run at one data center to ensure data locality. Different tiers are used to separate the processes and software stacks involved and our central frontier coordinates the crawling process and thus the data partitioning.

  • Crawling data is written to the data infrastructure in a specific folder structure, which is shared accross all data centers.

  • The crawl data can be accessed using iRoDS, which federates requests between data centers.

  • Via our toolkit, a client application pulls the data from previous days or from today and imports it into the local client (e.g. for Lucene or Solr). Deduplication happens on the client side during import for daily updates.

    • We plan to provide already deduplicated core indices in the future, which can serve as starting point for daily updates

Open Source Software Repositories#

Each of the parts we developed to build and sustain the OWI and OWSE-HUB is published as an open-source software repository. Archived version of our artefacts can be found at the owseu zenodo community: