Starting Point: Working with Open Web Index Data using owilix#
This tutorial introduces the owilix command-line tool for accessing, downloading, and querying datasets from the Open Web Index.
1. Getting Started#
Depending on your operating system, you can install owilix using either docker or conda.
Note
Note that owilix requires Python 3.11 and is only tested under Linux and Mac
Using Docker#
Pull the image:
docker pull opencode.it4i.eu:5050/openwebsearcheu-public/owi-cli:latest
# see if everyting is working
docker run -it owilix --help
Run commands:
# use an owilix command using current configuration
docker run --rm -it -v ~/.owi/:/home/owi/.owi owilix remote ls
Set an alias:
# Add the alias at the end of the file
alias owilixd='docker run --rm -it -v ~/.owi/:/home/owi/.owi owilix'
Using Python#
Create a New Environment#
We recommend using a dedicated environment:
conda create -n owi pip python=3.11
conda activate owi
Install Required Packages#
Install the dependencies:
pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
pip install owilix --index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
2. Pulling and Inspecting Datasets#
List Available Datasets and check that everything works
owilix remote ls
Download a Small Dataset (only pages with impressum, imprint, contact, privacy policy and terms of use in their url)
owilix remote pull all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905
Note
Note that pull synchronizes local and remote datasets on a per file basis allowing to resume downloads.
Results are stored in the local directory: ~/.owi/public/
View Some Entries through the CLI or as json
owilix query less --local all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905
Note
datasets can be specified with the following syntax:
datacenter:startdate#days/filter=a;filter=b
e.g. owilix local all:latest#10/collectionName=main
owilix query less --local all/id=e3fb8860-8e52-11f0-9ae7-c687956b5905 as_json=True
Note
Note that owilix help gives you list of commands and paramters:
owilix --help
owilix remote --help
owilix remote help pull (for command groups help is considered a subcommand)
3. Using OWI Data in a small search engine (requires Docker)#
We provide a small Lucene based search engine that can be used to query the OWI data.
We will show here, how this can be done. For more details see the MOSAIC tutorial.
Download Data#
We select a dataset and inspect its detail (which takes a bit longer). The id is 7bfda93e-5f84-11f0-9374-528c047b29ff and we list the files plus group them. Note that we have to make sure taking a dataset from the main collection, as other collections do not have index files yet.
owilix remote ls all/id=7bfda93e-5f84-11f0-9374-528c047b29ff details=True "files=**/metadata*parquet" groups=4
Next download the data. But to keep it smaller, we only download the english files via a glob pattern. for best utilisation, we increase the number of parallel threads.
owilix remote pull all/id=7bfda93e-5f84-11f0-9374-528c047b29ff files="**/language=eng/*" num_threads=10
(Optional) A bit more sophisticated data selecting: Slice the datasset into a new dataset#
Alternatively you can also slice the data further down by creating a new dataset:
owilix query slice --local all/id=7bfda93e-5f84-11f0-9374-528c047b29ff "where=WHERE url_suffix='at'" collection_name="mycollection" creator="me"
You will create a dataset with a new id. If you did not note down the id, you can find it with ls
owilix query ls --local all/collectionNAme=mycollection
Note that you must use this ID in the next steps.
Preparing Data for MOSAIC#
Now change to a directory where you want to store the index data. The MOSAIC framework requires the data to be in a single ciff file and multiple parquet files in a special folder structure.
cd ~/tmp/
mkdir data
owilix local export all/id=7bfda93e-5f84-11f0-9374-528c047b29ff outdir=$(PWD)/data
We can now import the data into the MOSAIC framework.
mkdir -p data/serve/lucene
docker run \
--rm \
-v "$PWD/data":/data:Z \
opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/lucene-ciff \
/data/index.ciff.gz \
/data/serve/lucene/demo-index
mkdir -p data/serve/metadata/demo-index
mv data/*metadata_* data/serve/metadata/demo-index
Now start the MOSAIC framework.
docker run \
--rm \
-v "$PWD/data":/data:Z \
-p 8008:8008 \
opencode.it4i.eu:5050/openwebsearcheu-public/mosaic/search-service \
--lucene-dir-path /data/serve/lucene/ \
--parquet-dir-path /data/serve/metadata/
and you can now access the search engine at http://localhost:8008/search?q=test.
4. Working with Larger Datasets#
Pull Data Filtered by Language (German only)#
Dataset download can be filtered on the file leve. Since files are partitioned by language, you can thus filter by language.
owilix remote pull all/internalID=e3cbb70c-8e52-11f0-b931-c687956b5905 "files=**/language=deu/*"
Write Results to JSON File#
For strucutred output, you can write the results to a json file.
owilix query less --local all/internalID=e3fb8860-8e52-11f0-9ae7-c687956b5905 \
as_json=True json_file=/Users/username/tmp/myjson.json
Filter by Top-Level Domain (.at) and Select Specific Fields#
owilix query less --local all/internalID=e3fb8860-8e52-11f0-9ae7-c687956b5905 \
"select=title,url" "where=url_suffix='at'"
5. Advanced Queries#
Collect All Sites with Outgoing Links and Microdata#
owilix query less --local all/internalID=e3cbb70c-8e52-11f0-b931-c687956b5905 \
"select=url,title,outgoing_links,microdata,curlielabels_en" \
"where=microdata is not NULL"
6. Data and Schema#
Datasets are stored under:
~/.owi/public/
You can work directly with Parquet data (metadata + plain text).
CIFF data can be added to tools such as:
Lucene
Pisa
Prototype integration with MOSAIC RAG is under development: Mosaic RAG How-To
Schema:
Fixed columns#
Column |
Description |
Pyspark Datatype |
|---|---|---|
id |
Unique ID based on the SHA256-hash of the URL |
|
record_id |
UUID of the WARC record |
|
title |
Title of the document |
|
description |
Description from the document metadata |
|
keywords |
Keywords from the document metadata |
|
author |
Author from the document metadata |
|
main_content |
Main content of the HTML, formatted with minimal HTML tags ( |
|
json-ld |
String list of JSON-LD (https://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents) |
|
microdata |
String list of HTML Microdata (http://www.w3.org/TR/microdata/#json) |
|
opengraph |
String list of Open Graph data (https://ogp.me/) |
|
warc_date |
Date from the WARC header |
|
warc_ip |
IP Address from the WARC header |
|
url |
Full URL |
|
url_scheme |
URL scheme specifier |
|
url_path |
Hierarchical path after TLD |
|
url_params |
Parameters for last path element |
|
url_query |
Query component |
|
url_fragment |
Fragment identifier |
|
url_subdomain |
Subdomain of the network location |
|
url_domain |
Domain of the network location |
|
url_suffix |
Suffix according to the Public Suffix List |
|
url_is_private |
If the URL has a private suffix |
|
mime_type |
MIME-Type from the HTTP Header |
|
charset |
charset from the HTTP Header |
|
content_type_other |
List of key, value pairs from the content type that could not be parsed into MIME-type or charset |
|
http_server |
Server from the from the HTTP Header |
|
language |
Language as identified by language.py; Code according to ISO-639 Part 3 |
|
valid |
|
|
crawling_error |
Error message set by the crawler. Only set for records with |
|
warc_file |
Name of the original WARC-file that contained record |
|
warc_offset |
Offset of the record in |
|
schema_metadata |
List of key, value pairs that contain global settings like the |
|
Columns from modules#
Column |
Description |
Pyspark Datatype |
|---|---|---|
ows_canonical |
The canonical link if it exists |
|
ows_fetch_response_time |
Fetch time in ms |
|
ows_fetch_num_errors |
Number of errors while fetching (Timeout is the most prominent fetch error) |
|
ows_genai |
|
|
ows_genai_details |
If |
|
ows_index |
|
|
ows_referer |
The URL of the page that referred to the current one |
|
ows_resource_type |
Crawl from which the WARC-file originated; Files crawled by the University of Passau are labeled with “Owler” |
|
ows_tags |
List of tags assigned by the OWS crawler |
|
outgoing_links |
List of all hyperlinks in the HTML that start with ‘http’ |
|
image_links |
List of all links to images in the HTML that start with ‘http’ |
|
video_links |
List of all links to videos in the HTML that start with ‘http’ or iframes with a video |
|
iframes |
List of tuples for nodes that contain an iframe (and are not a video) |
|
curlielabels |
List of language specific domain labels according to Curlie.org. |
|
curlielabels_en |
List of English domain labels according to Curlie.org. Mapping by Lugeon, Sylvain; Piccardi, Tiziano. |
|
address |
List of dictionaries containing extracted location and coordinates |
See |
collection_indices |
List of collection indices that a record belongs to. Are defined via |
|
7. Data Selection#
You can filter datasets by:
Domain (e.g.,
.de)Topic (using curlie.org hierarchy labels)
Language (filename-based)
Site list (CSV of URLs, domains, or TLDs)
Structured data / microdata
Outlinks