# Creating the OWI

The most tricky part is developing a scalable pipeline to create a full open web index. The following animation shows the general working of our pipeline and the basic usage principle. Color indicates equal web pages (potentially crawled at different points in time).

<video width="100%" autoplay controls loop>
  <source src="../../_static/videos/pipeline_animation.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

- [Crawling](https://opencode.it4i.eu/openwebsearcheu-public/owler), [preprocessing](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline) and [indexing](https://opencode.it4i.eu/openwebsearcheu-public/spark-indexer) is conducted in batches, where full pipelines should run at one data center to ensure data locality. Different [tiers](infrastructure.md) are used to separate the processes and software stacks involved and our central [frontier](https://opencode.it4i.eu/openwebsearcheu-public/url-frontier) coordinates the crawling process and thus the data partitioning.
- Crawling data is written to the [data infrastructure](infrastructure.md) in a specific folder structure, which is shared accross all data centers.
- The crawl data can be accessed using [iRODS](https://irods.org/), which federates requests between data centers.
- Via our [toolkit](https://opencode.it4i.eu/openwebsearcheu-public/ciff-toolkit), a client application pulls the data from previous days or from today and imports it into the local client (e.g. for [Lucene](https://opencode.it4i.eu/openwebsearcheu-public/prototype-search-application) or [Solr](https://opencode.it4i.eu/openwebsearcheu-public/solr-ciff)). Deduplication happens on the client side during import for daily updates. 
  - We plan to provide already deduplicated core indices in the future, which can serve as starting point for daily updates
  
% ## Current Deployment
%
% ```{figure} ./figures/current-pipeline-deployment.png
%
% Current deployment structure of the pipeline. Green shows planned workflows.
% ``` 

## Open Source Software Repositories

Each of the parts we develop to build and sustain the OWI and OWSE-HUB is published as an open-source software repository. Archived versions of our artefacts can be found at [the owseu zenodo community](https://zenodo.org/communities/owseu/):

 - [Open Web Search Crawler (OWler)](https://opencode.it4i.eu/openwebsearcheu-public/owler)
   CCT – Deployed on Apache Storm – Continuous Service 
 - [Resiliparse: WARC/HTML parsing and preprocessing library](https://github.com/chatnoir-eu/chatnoir-resiliparse)
   PET – Library used in Spark Jobs
-  [Resilipipe: Cluster-based web content analysis with Resiliparse.](https://opencode.it4i.eu/openwebsearcheu-public/preprocessing-pipeline)
   PET – Spark Jobs running Preprocessing and Enrichment
 - [TIRA: platform for replicability and comparison of information retrieval experiments](https://github.com/tira-io/tira)
   PPET – Platform running as continuous services (at CSC) for evaluating retrieval pipelines (including our plugins)
- [Open Web Search Indexer](https://opencode.it4i.eu/openwebsearcheu-public/spark-indexer)
   IST – Spark Job running the OWS Indexer using the CIFF Toolkit
- [CIFF toolkit: library for processing WARC files](https://opencode.it4i.eu/openwebsearcheu-public/ciff-toolkit)
   IST – Library used in indexing and for end users working with CIFF Files
- [CIFF to Lucene index converter](https://opencode.it4i.eu/openwebsearcheu-public/lucene-ciff)
   IST – Library used in indexing and for end users working with CIFF Files
