Collection Indices#
Collection indices refer to materialised subsets of the main index, which seem to be important / relevant to keep them pre-computed and pre-stored.
Collection indices may contain only the individual preprocessed web-pages plus metadata stored in parquet files or also pre-computed ciff
index files.
Collection Index: legal
#
One collection index that is available is called legal
, as it contains only
Collection Index: embeddings
#
We will release an additional collection index containing embeddings for (part of) our crawled data. The parquet files
in the embeddings
index will contain the following information:
start_end_position
: A list of tuples containing the start and end character positions of every chunk.embeddings
: A list of embeddings created for the document.