OWI Access via owilix#

TL;DR

  • owilix is a command line interface for providing git like access to the Open Web Index for file-based sycning of remote datasets with local ones

  • owilix allows to list, pull, push, slice and query index shards

  • owilix builds on top of py4lexis, but adds structured dataset management and allows to parallelize access to multiple repositories

  • owilix selects datasets via a identifier specifying data-center, date-range, metadata. e.g. lrz:latest/access=public or all:2024-01-08#7/collectionName=legal`

Consequently, we have developed the Open Web Index Command Line Tool, also known as owilix, to improve accessibility to the Open Web index. owilix is a sophisticated command-line interface tool designed to manage data slices for the Open Web Index (OpenWebIndex.eu). owilix offers pull as well as push functionality and integrates powerful SQL querying capabilities via DuckDB and the Parquet file format. While pull functionality allows for downloading the OWI-shards, push is designed to accept community contributions, like for example dense index files or additional metadata.

This tool is essential for researchers, developers, and data scientists working with large-scale web datasets, providing efficient means to version, retrieve, manage, and analyze data from the Open Web Index ecosystem. owilix stands out in its ability to handle both local and remote datasets, offering a seamless experience in data management across different environments. Its primary purpose is to facilitate operations such as pulling data, listing files with specific criteria, and executing complex queries on datasets, making it an indispensable asset for anyone working with web-scale data in the context of open web search initiatives.

Example Usage#

Example of owilix as git like tool for index data:

Architecture#

The architecture of owilix is designed to integrate smoothly with the broader Open Web Index ecosystem while providing a standalone utility for data management and querying. Figure 5 shows the core elements of owilix. The tool is written in python and built on py4lexis, the Python interface for the LEXIS Platform. It interacts with the LEXIS Platform for logging (via B2ACCESS / EUDAT services) and for getting dataset metadata. Pulling and pushing datasets happens directly via iRODS, which offers efficient parallel download capabilities. On top of iRODS we are using DuckDB as SQL querying engine, which allows for efficient, remote and parallel querying of index partitions stored in iRODS.

The owilix key issue is an abstraction from individual repositories even beyond Lexis.

So while Lexis and iRoDS are our main focus points for the Open Web Index, we can also integrate further dataset on S3 or other remote storage services, which gives owilix enormous flexibility in managing datasets.

Here’s an overview of its key architectural components and relationships:

  1. Core CLI Interface: The heart of owilix is its command-line interface, which interprets user commands and orchestrates actions across other components. This interface is built using Python, ensuring cross-platform compatibility and ease of extension.

  2. Local Data Manager: This component handles operations on local datasets, including insertion, listing, and statistical analysis. It manages the local storage of datasets, typically in a designated directory (defaulting to $HOME/.owi).

  3. Remote Data Interface: Responsible for interactions with remote data centers, this component facilitates operations like pulling datasets and comparing versions across different remote locations.

  4. Data Versioning System: Drawing parallels to Git, owilix implements a versioning system for datasets. This allows users to push local changes to remote repositories and pull updates from remote sources, maintaining a history of dataset versions.

  5. Query Engine Integration: The owilix CLI seamlessly integrates with DuckDB, a powerful embedded analytical database system. This integration enables complex querying capabilities directly on Parquet files, both locally and remotely.

  6. Authentication Module: Ensures secure access to remote resources by managing user authentication, typically through token-based systems.

  7. Plugin System (under construction): Allows for the extension of owilix’s capabilities through additional modules, such as specific repository types or custom query engines.

Common Use Cases and Example Commands#

For the latest documentation and details, please refer to the Owilix Readme.

The OWI CLI tool is versatile and can be used in various scenarios, reflecting the needs of different users within the OpenWebSearch.eu ecosystem:

1 Managing Datasets: The CLI allows users to manage their datasets effectively and can be best understod as file-based synchronisation between remote and local index files. This includes inserting new datasets, listing available datasets, and deleting outdated records. Users can also perform statistical analyses on their datasets, providing insights into the structure and content of the data they are working with. 2 Querying Data: Users can retrieve specific datasets or records from OpenSearch by specifying criteria such as date ranges, data centers, or metadata tags. 3. Troubleshooting and Debugging: In cases where issues arise within the aggregation pipeline, the CLI can be used to diagnose and resolve problems. Users can query logs, check the status of the pipeline, or inspect specific data flows to identify and address any issues

Practical examples of how the OWI CLI commands are used:

  • Listing Local Datasets: A user can run the command owilix local ls all:latest to list all the datasets available locally that were indexed on the latest date. The output would include a summary of the datasets, including metadata such as the date of indexing, the data center source, and the dataset size.

  • Pulling Data from Remote Servers: To pull data from a remote data center, a user could issue the command owilix remote pull lrz:latest/access=public. This would download the latest public datasets from the LRZ data center, storing them locally for further analysis.

  • Advanced Querying: A more complex command might involve querying specific data records, such as owilix query less --remote all:2023-12-4 limit=500 "where=url_suffix='at'". This command would retrieve records from December 4, 2023, filtering for URLs with suffix “at”, limiting the results to 500 entries.

OWILIX Dataset Selectors#

owilix allows to specify sets of datasets via an selector of the form {datacenter|all}:{YYYY-MM-DD|latest}#{days}/{key=value;key=value}.

For a single dataset, you can also obtain the identifier plus the command for pulling the dataset from the dasbhoard by clicking on <> in the dashboard dataset card

Experimental Features and Future Extensions#

  • Database queries

  • Dataset push

  • Full-text search

Source Code and Documentation#

More information can be found under the following links: