Tutorial 0: How to create an index locally#

This tutorial shows you, how to setup our indexing pipeline locally, based on local crawls and using docker.

Prerequisites#

You should have docker installed on your system such that the docker command is available from your command line. Also, get or a similar tool for scraping websites should be at your disposal.

:::{note} Please note that you should use local crawling, as done in this tutorial, only for testing purposes or in special cases. Crawling puts potentially heave load on server systems and can be considered as unpolite, if it does not follow the crawl etiquett. :::

Step 1: Scraping a list of Webpages#

As a first step, you should perform a simple crawl using wget or a similar tool:

# Create a directory that will be mounted in the Docker containers
mkdir -p data

wget --input-file urls.txt \
    --recursive \
    --level 2 \
    --delete-after \
    --no-directories \
    --warc-file data/crawl

You can leave out the –recursive and –level arguments if you only want to fetch the list of URLs, and don’t want to perform recursive or explorative crawling. Note that wget includes angle brackets < and > around the URL. The preprocessing pipeline handles this correctly, but the final metadata files will still include these brackets in the column for the complete URL.

Note

We assume all directories to be relative from the parent directory. If you have a different setup, you need to adjust the paths accordingly.

Hint

For large scale crawling, you can also setup our Open Web Index Crawler - short OWLER or explore one of our other options down below.

Step 2: Preprocessing#

Given the results of your crawler, you can run the preprocessing pipeline:

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/preprocessing-pipeline \
    /data/crawl.warc.gz \
    /data/metadata.parquet.gz
    

Hint

For more details and the source code take a look at our preprocessing pipeline.

Step 3: Indexing#

And finally, you can index the preprocessed metadata and get an the index for your crawl:

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    --description "CIFF description" \
    --input-format parquet \
    --output-format ciff \
    --id-col record_id \
    --content-col plain_text \
    /data/metadata.parquet.gz \
    /data/index/

You should now have the following files:

data/crawl.warc.gz
data/metadata.parquet.gz
data/index/index.ciff.gz

Hint

For more details on the indexing details please take a look at our spark indexer. Especially when experiencing Out of Heap Memory errors.

Step 4: Consuming the index#

You can consume the index using MOSAIC or other CIFF-compatible search engine libraries / frameworks like PyTerrier PISA.

For short, you have to import the CIFF file to a Lucene index and then run the application using the imported index. Metadata is consumed from the Parquet file and both the name of the Lucene index and the directory name that contains the Parquet file must be the same.

First, run the container for the importer:

  1. Import the CIFF file to a Lucene index

mkdir -p data/serve/lucene
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/lucene-ciff \
    /data/index/index.ciff.gz \
    /data/serve/lucene/demo-index
mkdir -p data/serve/metadata/demo-index
cp data/metadata.parquet.gz data/serve/metadata/demo-index/metadata.parquet.gz

This will create the Lucene index /data/serve/lucene/demo-index. Then, serve the Lucene index:

  1. Run the search application

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    -p 8008:8008 \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/search-service \
    --lucene-dir-path /data/serve/lucene/ \
    --parquet-dir-path /data/serve/metadata/

The application should be running on localhost:8008 now and you should be able to perform a search query (e.g., http://localhost:8008/search?q=europe). The parameters are:

  • –lucene-dir-path: path of the directory that contains the Lucene index(es)

  • –parquet-dir-path: path of the directory that contains the Parquet file(s)

Hint

For more details on starting the MOSAIC search service, please take a look at available options.

Note

The name of the Lucene index (i.e., the directory name) and the name of the directory that contains the Parquet file must match {index-name}/metadata.parquet.gz (e.g., if the name of the Lucene index is wiki, the associated metadata directory name must be wiki.

Options: Other data sources for indexing#

Wikipedia: Simple Wikipedia Abstracts#

Another interesting starting point is to index the Simple Wikipedia from their dumps.

Simple Wikipedia Abstract#

The Simple Wikipedia Abstract are quite small and quick to process.

So instead of crawling your own files in Step 1, use the wiki-to-ows tool to create the WARC file as follows:

docker run \
    --rm \
    -v "$PWD/data":/data \
    opencode.it4i.eu:5050/openwebsearcheu-public/wiki-to-ows \
    --download=https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-abstract.xml.gz \
    --warc \
    --compress \
    -o /data/simple_wiki_abstracts

Note that indexing might require more heap memory (see below).

Options for Indexing#

Setting more Heap Memory and other Spark Properties#

For larger crawls the indexing might run out of heap memory. You can either create smaller chunks or increase the heap size. You can override any spark properties using a spark.conf file, as follows:

echo 'spark.driver.memory=8g\nspark.executor.memory=8g'>$PWD/spark-properties.conf && docker run \
--rm \
-v "$PWD/data":/data:Z \
-v "$PWD/spark-properties.conf":/opt/spark/conf/spark-defaults.conf \
opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
--description "CIFF description" \
--input-format parquet \
--output-format ciff \
--id-col record_id \
--content-col plain_text \
/data/metadata.parquet.gz \
/data/index/

You can also override the entrypoint:

docker run \
    --rm \
    -v "$PWD/tmp":/data:Z -w /opt/spark/work-dir --entrypoint /bin/bash  \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    -c "/opt/spark/bin/spark-submit --driver-memory 8g --executor-memory 8g Indexer-assembly-1.0.jar index --description 'CIFF description'  --input-format parquet --output-format ciff --id-col record_id --content-col plain_text  /data/metadata.parquet.gz /data/index"

Alternatively, you can also increase the number of partitions to be used when indexing using the --num-partitions options.

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    --description "CIFF description" \
    --input-format parquet \
    --output-format ciff \
    --id-col record_id \
    --content-col plain_text \
    --num-partitions 2000 \
    /data/metadata.parquet.gz \
    /data/index/