Starting Point: Working with Open Web Index Data using owilix#

This tutorial introduces the owilix command-line tool for accessing, downloading, and querying datasets from the Open Web Index.


1. Getting Started#

Depending on your operating system, you can install owilix using either docker or conda.

Note

Note that owilix requires Python 3.11 and is only tested under Linux and Mac

Using Docker#

Pull the image:

docker pull opencode.it4i.eu:5050/openwebsearcheu-public/owi-cli:latest
# see if everyting is working
docker run -it owilix --help

Run commands:

# use an owilix command using current configuration
docker run --rm -it -v ~/.owi/:/home/owi/.owi owilix remote ls

Set an alias:

# Add the alias at the end of the file
alias owilixd='docker run --rm -it -v ~/.owi/:/home/owi/.owi owilix'

Using Python#

Create a New Environment#

We recommend using a dedicated environment:

conda create -n owi pip python=3.11
conda activate owi

Install Required Packages#

Install the dependencies:

pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install owilix   --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple

2. Pulling and Inspecting Datasets#

  1. List Available Datasets and check that everything works

owilix remote ls
  1. Download a Small Dataset (only pages with impressum, imprint, contact, privacy policy and terms of use in their url)

owilix remote pull all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905

Note

Note that pull synchronizes local and remote datasets on a per file basis allowing to resume downloads. Results are stored in the local directory: ~/.owi/public/

  1. View Some Entries through the CLI or as json

owilix query less --local all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905

Note

datasets can be specified with the following syntax: datacenter:startdate#days/filter=a;filter=b

e.g. owilix local all:latest#10/collectionName=main


owilix query less --local all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905 as_json=True

Note

Note that owilix help gives you list of commands and paramters: owilix --help owilix remote --help owilix remote help pull (for command groups help is considered a subcommand)


3. Using OWI Data in a small search engine (requires Docker)#

We provide a small Lucene based search engine that can be used to query the OWI data.

We will show here, how this can be done. For more details see the MOSAIC tutorial.

Download Data#

We select a dataset and inspect its detail (which takes a bit longer). The id is 7bfda93e-5f84-11f0-9374-528c047b29ff and we list the files plus group them. Note that we have to make sure taking a dataset from the main collection, as other collections do not have index files yet.

owilix remote ls all/id=7bfda93e-5f84-11f0-9374-528c047b29ff details=True "files=**/metadata*parquet" groups=4

Next download the data. But to keep it smaller, we only download the english files via a glob pattern. for best utilisation, we increase the number of parallel threads.

owilix remote pull all/id=7bfda93e-5f84-11f0-9374-528c047b29ff files="**/language=eng/*" num_threads=10

(Optional) A bit more sophisticated data selecting: Slice the datasset into a new dataset#

Alternatively you can also slice the data further down by creating a new dataset:

owilix query slice --local all/id=7bfda93e-5f84-11f0-9374-528c047b29ff "where=WHERE url_suffix='at'" collection_name="mycollection" creator="me"

You will create a dataset with a new id. If you did not note down the id, you can find it with ls

owilix query ls --local all/collectionNAme=mycollection

Note that you must use this ID in the next steps.

Preparing Data for MOSAIC#

Now change to a directory where you want to store the index data. The MOSAIC framework requires the data to be in a single ciff file and multiple parquet files in a special folder structure.

cd ~/tmp/
mkdir data
owilix local export all/id=7bfda93e-5f84-11f0-9374-528c047b29ff outdir=$(PWD)/data

We can now import the data into the MOSAIC framework.

mkdir -p data/serve/lucene
docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/lucene-ciff \
    /data/index.ciff.gz \
    /data/serve/lucene/demo-index
mkdir -p data/serve/metadata/demo-index
mv data/*metadata_* data/serve/metadata/demo-index

Now start the MOSAIC framework.

docker run \
    --rm \
    -v "$PWD/data":/data:Z \
    -p 8008:8008 \
    opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/search-service \
    --lucene-dir-path /data/serve/lucene/ \
    --parquet-dir-path /data/serve/metadata/

and you can now access the search engine at http://localhost:8008/search?q=test.


4. Working with Larger Datasets#

Pull Data Filtered by Language (German only)#

Dataset download can be filtered on the file leve. Since files are partitioned by language, you can thus filter by language.

owilix remote pull all/internalID=e3cbb70c-8e52-11f0-b931-c687956b5905 "files=**/language=deu/*"

Write Results to JSON File#

For strucutred output, you can write the results to a json file.

owilix query less --local all/internalID=e3fb8860-8e52-11f0-9ae7-c687956b5905 \
    as_json=True json_file=/Users/username/tmp/myjson.json

Filter by Top-Level Domain (.at) and Select Specific Fields#

owilix query less --local all/internalID=e3fb8860-8e52-11f0-9ae7-c687956b5905 \
    "select=title,url" "where=url_suffix='at'"

5. Advanced Queries#


6. Data and Schema#

  • Datasets are stored under:

    ~/.owi/public/
    
  • You can work directly with Parquet data (metadata + plain text).

  • CIFF data can be added to tools such as:

  • Prototype integration with MOSAIC RAG is under development: Mosaic RAG How-To

Schema:

Fixed columns#

Column

Description

Pyspark Datatype

id

Unique ID based on the SHA256-hash of the URL

StringType()

record_id

UUID of the WARC record

StringType()

title

Title of the document

StringType()

description

Description from the document metadata

StringType()

keywords

Keywords from the document metadata

StringType()

author

Author from the document metadata

StringType()

main_content

Main content of the HTML, formatted with minimal HTML tags (h1-6, p, ul/ol/li, pre, and a tags)

StringType()

json-ld

String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents)

StringType()

microdata

String list of HTML Microdata (http://www.w3.org/TR/microdata/#json)

StringType()

opengraph

String list of Open Graph data (https://ogp.me/)

StringType()

warc_date

Date from the WARC header

StringType()

warc_ip

IP Address from the WARC header

StringType()

url

Full URL

StringType()

url_scheme

URL scheme specifier

StringType()

url_path

Hierarchical path after TLD

StringType()

url_params

Parameters for last path element

StringType()

url_query

Query component

StringType()

url_fragment

Fragment identifier

StringType()

url_subdomain

Subdomain of the network location

StringType()

url_domain

Domain of the network location

StringType()

url_suffix

Suffix according to the Public Suffix List

StringType()

url_is_private

If the URL has a private suffix

BooleanType()

mime_type

MIME-Type from the HTTP Header

StringType()

charset

charset from the HTTP Header

StringType()

content_type_other

List of key, value pairs from the content type that could not be parsed into MIME-type or charset

MapType(StringType(), StringType())

http_server

Server from the from the HTTP Header

StringType()

language

Language as identified by language.py; Code according to ISO-639 Part 3

StringType()

valid

True: The record is valid; False: The record is not/no longer valid and should not be processed.

BooleanType()

crawling_error

Error message set by the crawler. Only set for records with valid=False

StringType()

warc_file

Name of the original WARC-file that contained record

StringType()

warc_offset

Offset of the record in warc_file in the (uncompressed) stream

IntegerType()

schema_metadata

List of key, value pairs that contain global settings like the schema_version

MapType(StringType(), StringType())

Columns from modules#

Column

Description

Pyspark Datatype

ows_canonical

The canonical link if it exists

StringType()

ows_fetch_response_time

Fetch time in ms

IntegerType()

ows_fetch_num_errors

Number of errors while fetching (Timeout is the most prominent fetch error)

StringType()

ows_genai

True: The content is allowed to be used for the purposes of developing Generative AI models; False: The content cannot be used

BooleanType()

ows_genai_details

If ows_genai=False, this provides additional context

StringType()

ows_index

True: The content is allowed to be used for the purposes of web indexing/web search; False: The content cannot be used

BooleanType()

ows_referer

The URL of the page that referred to the current one

StringType()

ows_resource_type

Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with “Owler”

StringType()

ows_tags

List of tags assigned by the OWS crawler

ArrayType(StringType())

outgoing_links

List of all hyperlinks in the HTML that start with ‘http’

StructType with src and anchor_text

image_links

List of all links to images in the HTML that start with ‘http’

StructType with src, width, and height

video_links

List of all links to videos in the HTML that start with ‘http’ or iframes with a video

StructType with src, width, and height

iframes

List of tuples for nodes that contain an iframe (and are not a video)

StructType with src, width, and height

curlielabels

List of language specific domain labels according to Curlie.org.

ArrayType(StringType())

curlielabels_en

List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano.

ArrayType(StringType())

address

List of dictionaries containing extracted location and coordinates

See get_spark_schema in geoparsing.py

collection_indices

List of collection indices that a record belongs to. Are defined via yaml files on the S3 instance

ArrayType(StringType())


7. Data Selection#

You can filter datasets by:

  • Domain (e.g., .de)

  • Topic (using curlie.org hierarchy labels)

  • Language (filename-based)

  • Site list (CSV of URLs, domains, or TLDs)

  • Structured data / microdata

  • Outlinks