Indexing

Indexing#

The IST takes the cleaned content from the PET, and turns it into three data products:

  1. Inverted file(s) of the document collection

  2. Web graph (including anchor texts)

  3. Metadata store

Indexing is supported in the Index and Storage Tier (IST). The IST turns the cleaned content from the PET into a usable index/inverted file. Similar to the CCT and the PET, each data center should have its own IST (and only one), responsible for creating an index for the documents crawled by that data center. Indexes are inverted files partitioned into so-called “shards” by selected types of metadata derived by the PET (e.g., topic and language), distributed as CIFF ([LMK+20]) files. Similar to the PET, the IST is implemented as a Spark batch job. In its current state, it reads the Parquet files delivered by the PET and writes out inverted files per shard (by arbitrary metadata values). This allows us to build semantically coherent shards of the full web index (dependent on the metadata extracted by the PET), which can be used to enable a large variety of downstream search engines. For instance, language can be used to build search engines for specific countries, geo-information can be used to focus on specific areas, and classified topics or genres can be used to build search engines focusing on a specific area (like news and/or sports). Currently (end of July, 2023) the IST has successfully been deployed to both BADW-LRZ and IT4I@VSB. Current efforts are focusing on scaling up the indexer and building indexes for the content crawled thus far.

We make a distinction between a) the indexer, which produces the inverted files and web graph, and b) the metadata store. Indexer The indexer runs in Spark, and gets its input from the first (quick and simple) preprocessing stage. It works in batch form on entire crawls, or at the very least, large collections of documents. It needs the following input data:

  • Document identifier; a string or numerical identifier that can be used to reference the document in different places: the inverted file, the metadata store, the web graph, etc. This can be a normalised version of the URL (e.g. something like SURT) or consistent, reproducible UUIDs. Probably the same IDs as: Unique Identifiers for URLs Ideally, the identifiers are sort friendly, meaning similar URLs (same website, domain, etc.) are grouped together when sorting the documents. This could allow for more efficient handout of internal identifiers in the inverted file.

  • Cleaned text content; i.e. the text content that remains after HTML parsing, boilerplate removal and other cleaning of the data. Initially, all text can be given as a single document. Later on, we might need to make a distinction between, for example, the title and the body of a document.

  • Outgoing links; to build the web graph, we need the list of outgoing hyperlinks for each document. Ideally, these are also normalised in the same way as the document identifiers, to easily match links with their target pages.

  • Metadata used for splitting the index; as discussed throughout the project, efficient usage of the Open Web Index means the index can be split into subparts for usage in more specific use cases. To efficiently partition the index during the indexing phase, this metadata needs to be specified before indexing. Think of the following metadata (feel free to add or discuss):

    • Language

    • TLD

    • Topic (propagated from the crawler at first)

    • Timestamp Inverted files The indexer produces a single inverted file for each partition of the dataset, in the CIFF format . The inverted files use a different internal document identifier scheme, as it uses d-gap compression to limit the size of the index. Web graph The web graph is defined as a list of edges, where each edge consists of:

  • Identifier of the ‘ from’ document

  • Identifier of the ‘ to’ document

  • Anchor text If downloaded, the web graph is distributed as a Parquet file. CIFF toolkit Accompanying the inverted files in CIFF format, we are building a CIFF toolkit to manipulate the CIFF files produced by the indexer. All operations in the CIFF toolkit modify data on the fly, reading data lazily and writing to disk whenever possible. As such, we do not need to load the entire CIFF file in memory. To enable streaming operations on CIFF files, it is necessary that the document records are processed before the posting lists. For instance, when merging CIFF files, we need to re-map the internal docids, and use these new docids for the merged CIFF file. This can only be done if we first process doc records and only then process the posting lists. Current features of the CIFF toolkit:

  • Swap between header - doc record - posting list and header – posting list – doc record orders in CIFF files

  • Merge two or more CIFF files Planned features:

  • Filter CIFF files on document identifiers

  • In-memory processing of CIFF operations if they fit in memory (e.g. for small CIFF files or HPC nodes with large memory availability). This also allows multiprocessing to be used for the merging of posting lists. Metadata store The metadata store hosts a relational database containing tuples <docid, meta_key, meta_value>. We have opted for a relational database because of lower cost of insertion. Once (part of) the metadata store is requested for download, it is exported as Parquet. As underlying database engine, we will likely opt for DuckDB. The inputs for the metadata store come from the second (expensive) stage of the preprocessing & enrichment pipeline. As a frontend, we host a REST API where metadata can be inserted, and parts of the metadata store can be queried.