Collection Indices#
Collection indices refer to materialised subsets of the main index, which seem to be important / relevant to keep them pre-computed and pre-stored.
Collection indices may contain only the individual preprocessed web-pages plus metadata stored in parquet files or also pre-computed ciff index files.
Collection Index: legal#
One collection index that is available is called legal, as it contains only
Collection Index: embeddings#
We will release an additional collection index containing embeddings for (part of) our crawled data. The parquet files
in the embeddings index will contain the following information:
start_end_position: A list of tuples containing the start and end character positions of every chunk.embeddings: A list of embeddings created for the document.