Tutorial: How to create an index locally#

This tutorial shows you, how to setup our indexing pipeline locally, based on local crawls and using docker.

Prerequisites#

You should have docker installed on your system such that the docker command is available from your command line. Also, get or a similar tool for scraping websites should be at your disposal.

Note

Please note that you should use local crawling, as done in this tutorial, only for testing purposes or in special cases. Crawling puts potentially heave load on server systems and can be considered as unpolite, if it does not follow the crawl etiquett.

Step 1: Scraping a list of Webpages#

As a first step, you should perform a simple crawl using wget or a similar tool:

# Create a directory that will be mounted in the Docker containers
mkdir -p data

wget --input-file urls.txt \
    --recursive \
    --level 2 \
    --delete-after \
    --no-directories \
    --warc-file data/crawl

You can leave out the –recursive and –level arguments if you only want to fetch the list of URLs, and don’t want to perform recursive or explorative crawling. Note that wget includes angle brackets < and > around the URL. The preprocessing pipeline handles this correctly, but the final metadata files will still include these brackets in the column for the complete URL.

Note

We assume all directories to be relative from the parent directory. If you have a different setup, you need to adjust the paths accordingly.

Hint

For large scale crawling, you can also setup our Open Web Index Crawler - short OWLER or explore one of our other options down below.

Step 2: Preprocessing#

Given the results of your crawler, you can run the preprocessing pipeline:

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/preprocessing-pipeline \
    /data/crawl.warc.gz \
    /data/metadata.parquet.gz
    

Hint

For more details and the source code take a look at our preprocessing pipeline.

Step 3: Indexing#

And finally, you can index the preprocessed metadata and get an the index for your crawl:

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    --description "CIFF description" \
    --input-format parquet \
    --output-format ciff \
    --id-col record_id \
    --content-col plain_text \
    /data/metadata.parquet.gz \
    /data/index/

You should now have the following files:

data/crawl.warc.gz
data/metadata.parquet.gz
data/index/index.ciff.gz

Hint

For more details on the indexing details please take a look at our spark indexer. Especially when experiencing Out of Heap Memory errors.

Step 4: Consuming the index#

You can consume the index using our prototype search application or other CIFF-compatible search engine libraries / frameworks like PyTerrier PISA.

For short, you have to convert the CIFF file to a Lucene index and then run the application using the converted index. Metadata is consumed from the parquet file and both have to have the same name.

First, unzip the CIFF file and run the container for the converter:

  1. unzip the created CIFF file and convert the unzipped CIFF file to a Lucene index

gunzip data/index/index.ciff.gz
mkdir -p data/serve/lucene
docker run --rm -v "$PWD/data":/data:Z opencode.it4i.eu:5050/openwebsearcheu-public/prototype-search-application/ciff-lucene-converter /data/index/index.ciff /data/serve/lucene/index
mkdir -p data/serve/metadata
cp data/metadata.parquet.gz data/serve/metadata/index.parquet.gz

This will create the Lucene index /data/index/local-search-index.lucene. Then, serve the Lucene index:

  1. run the search application

docker run --rm -v "$PWD/data":/data:Z -p 8000:8000 opencode.it4i.eu:5050/openwebsearcheu-public/prototype-search-application/search-service --default-index index --port 8000 --lucene-dir-path /data/serve/lucene/ --parquet-dir-path /data/serve/metadata/

The application should be running on localhost:8000 now and you should be able to perform a search query (e.g., http://localhost:8000/search?q=graz). The parameters are:

  • –default-index: name of the Lucene index (i.e., the directory name)

  • –port: port of the application inside the Docker container (if changed, it may be necessary to also change the port mapping -p 8000:8000)

  • –lucene-dir-path: path of the directory that contains the Lucene index(es)

  • –parquet-dir-path: path of the directory that contains the Parquet file(s)

Note

The name of the Lucene index (i.e., the directory name) and the name of the Parquet file must match {index-name}.parquet.gz (e.g., if the name of the Lucene index is wiki, the associated Parquet file name must be either wiki.parquet.gz or wiki.parquet.

Options: Other data sources for indexing#

Wikipedia: Simple Wikipedia Abstracts#

Another interesting starting point is to index the Simple Wikipedia from their dumps.

Simple Wikipedia Abstract#

The Simple Wikipedia Abstract are quite small and quick to process.

So instead of crawling your own files in Step 1, use the wiki-to-ows tool to create the WARC file as follows:

docker run \
    --rm \
    -v "$PWD/data":/data \
    opencode.it4i.eu:5050/openwebsearcheu-public/wiki-to-ows \
    --download=https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-abstract.xml.gz \
    --warc \
    --compress \
    -o /data/simple_wiki_abstracts

Note that indexing might require more heap memory (see below).

Options for Indexing#

Setting more Heap Memory and other Spark Properties#

For larger crawls the indexing might run out of heap memory. You can either create smaller chunks or increase the heap size. You can override any spark properties using a spark.conf file, as follows:

echo 'spark.driver.memory=8g\nspark.executor.memory=8g'>$PWD/spark-properties.conf && docker run \
--rm \
-v "$PWD/data":/data:Z \
-v "$PWD/spark-properties.conf":/opt/spark/conf/spark-defaults.conf \
opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
--description "CIFF description" \
--input-format parquet \
--output-format ciff \
--id-col record_id \
--content-col plain_text \
/data/metadata.parquet.gz \
/data/index/

You can also override the entrypoint:

docker run \
    --rm \
    -v "$PWD/tmp":/data:Z -w /opt/spark/work-dir --entrypoint /bin/bash  \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    -c "/opt/spark/bin/spark-submit --driver-memory 8g --executor-memory 8g Indexer-assembly-1.0.jar index --description 'CIFF description'  --input-format parquet --output-format ciff --id-col record_id --content-col plain_text  /data/metadata.parquet.gz /data/index"

Alternatively, you can also increase the number of partitions to be used when indexing using the --num-partitions options.

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/spark-indexer \
    --description "CIFF description" \
    --input-format parquet \
    --output-format ciff \
    --id-col record_id \
    --content-col plain_text \
    --num-partitions 2000 \
    /data/metadata.parquet.gz \
    /data/index/