Realisation of the OWI

Realisation of the OWI#

To this end, we envision the OWI to become a (distributed) information system similar to the well-known Docker hub. Instead of virtual machines, though, the OWI would contain pre-built indexes that would be readily usable. Our general vision for this distributed system is shown in the figure below.

../../_images/owi-vision.png

Fig. 4 General architecture of the OWI, and the way in which a search engine could interact with the OWI to retrieve (parts of) the index.#

Our federated data infrastructure (spread across multiple European data centers) crawls, enriches and indexes web content in a distributed manner. These indexes are fragmented into a set of pre-defined (possibly overlapping) verticals, and continuously updated over time. Aside from these vertical indexes, we also create a “core” index that should contain the most popular and important websites on the Web. Since a small subset of the Web accounts for a large amount of user clicks, a core index containing only this popular subset of the Web should suffice for a large number of basic queries. When taken together, these index “shards” form the OWI enable a number of downstream uses (also shown in the overview above).

  1. Users or organisations can download (or “pull”) a specific, pre-built index.

  2. They can choose a specific timestamp or checkpoint of the index (e.g. “latest” for the most recent version).

  3. They can choose to download a selection of checkpoints, instead of only a single one (e.g. “all” for the complete history of a specific index).

  4. Users or organisations can create (or “build”) their own index locally, using a dataset of their choosing (e.g. privacy sensitive data, such as a corporate filesystem, or personal email).

  5. Users or organisations can upload (or “push”) a custom index or custom metadata to contribute to the OWI. The core data structure behind the OWI is the inverted file, a mapping between each term on the Web and the pages or documents in which it appears. To ensure usability of the OWI, we have chosen to store the inverted files in a common, easily transferable format: the Common Index File Format (CIFF) [LMK+20]

CIFF is a Protobuf schema that describes the inverted files in a structured, consistent and minimal format. A CIFF file consists of the following data:

– A header, containing basic statistics about the collection (e.g. the number of documents, the number of unique terms, the average document length, etc.). – For each term in the corpus, a record with the document frequency (how many documents contain the term), the collection frequency (how often the term occurs in total throughout the collection), and a posting list. The posting list has an entry for each document in which the term appears, containing an internal document identifier and the term frequency (how often the term occurs in that document). – For each document in the corpus, a record with the internal, numeric document identifier, the external document identifier (allowing the document to be mapped back to the original corpus), and the document length.

The CIFF standard contains the basic information needed to build a successful search engine using an index, which makes it easy for an existing search engine to import the data and transform it to their internal data structures — easier than transforming it from one search engine’s internal format into another’s. In fact, the CIFF standard was proposed by the developers of a number of existing open source search engines (like Lucene , (Py)Terrier , and PISA ), and these search engines already support reading and/or writing to CIFF. As a result, the indexes we build in the project can readily be used by external parties with minimal extra effort.

As part of our work on and with the OWI, we also investigate limitations of the CIFF standard, and propose extensions of the format where necessary.