Collection Indices#

Collection indices refer to materialised subsets of the main index, which seem to be important / relevant to keep them pre-computed and pre-stored.

Collection indices may contain only the individual preprocessed web-pages plus metadata stored in parquet files or also pre-computed ciff index files.

Collection Index: embeddings#

We will release an additional collection index containing embeddings for (part of) our crawled data. The parquet files in the embeddings index will contain the following information:

  • start_end_position: A list of tuples containing the start and end character positions of every chunk.

  • embeddings: A list of embeddings created for the document.