Tutorials#
The following chapter presents a set of tutorials to get you started with our technology. It serves as a quick start into the project.
Here’s an overview of all available tutorials:
Start: Start working with OpenWebSearch.eu Data This tutorial introduces the owilix command-line tool for accessing, downloading, and querying datasets from the Open Web Index.
Tutorial 0: How to create an index locally - Shows how to set up the indexing pipeline locally using Docker, from crawling websites to creating searchable indexes.
Tutorial 1: Data Download with owilix - Demonstrates how to use the
owilixcommand-line tool to access, download, and query Open Web Index datasets.Tutorial 2: How to use MOSAIC - Explains how to set up and run the MOSAIC search engine for searching through index data with a REST API and web interface.
Tutorial 3: Slicing Data with owilix - Covers how to create custom data subsets by running SQL queries against parquet files using
owilix.Tutorial 4: How to Analyse OWI Data - Shows how to pull OWI data and analyze it using Python, pandas, and Jupyter notebooks for statistical analysis.
Tutorial 5: How to download index files using the Lexis Platform - Provides guidance on downloading index files through the Lexis Platform (content coming soon).
Tutorial 6: Pushing Data to OpenSearch - Demonstrates how to use
owilixto push web data into OpenSearch clusters for internal search augmentation.Tutorial 7: Filter Sites and Push to OpenSearch - Shows how to build an OpenSearch index for specific websites using URL filtering with
owilix.Tutorial 8: How to Evaluate Components with TIRA/TIREx - Covers how to evaluate software components using the TIRA and TIREx evaluation platforms.
Tutorial 9: How to Develop New Modules for Resilipipe - Explains how to create custom processing modules for the Resilipipe WARC processing pipeline.
Tutorial 10: Hosting the OWI on your own S3 - Shows how to host Open Web Index data on your own S3 bucket for faster access and querying.
Tutorial 11: Building an OWI Lake - Shows how to integrate daily index shards into a datal lake as staging area for indexing and analyitcs.
Tutorial 12: Uploading your own dataset - Shows how to upload your own non-OWI dataset to share resources related to search, NLP andAI
Tutorial 99: Data Upload - Covers how to upload and share data using the
owilixcommand-line tool (experimental feature).Tutorial 14: Finetuning an LLM with OWI data using the LUMI supercomputer - Shows how to utilize Open Web Index data for LLM finetuning using LUMI supercomputer.