Tutorial 10. Hosting the OWI on your own S3#

Danger

The tutorial is not ready yet/ fully working. There are still some parsing errors that could not be resovled via the command line.

Scenario#

  • You want to host (parts of) the OWI index on your own S3 bucket for faster querying

  • You want to host slices of the OWI index on your own S3 Bucket.

Prerequisites#

  • Having owilix installed

  • Having an S3 bucket configured

1. Configure OWILIX to use your S3 bucket#

  1. Got to the owilix config directory: cd ~/.owilix

  2. Edit the “owilix.cfg” file: vi ~/.owi/owilix.cfg

  3. under repositories.config add the following:

     mys3: # name / abbrevation of teh repo. Must not start with +
       options: # Note: config options follow fsspec conventions
         protocol: s3a
         key: <Your Key here> # your key to s3 if it is not public
         secret: <Your Secret here> # your secret to s3 if it is not public
         endpoint: <endpoint> # your endpoint to s3
         path: owi-public/{access} # path to the bucket. {access} is a placeholder and must be provided
         async: True
       repository: s3a
    
  4. (Optional, but recommended): under the selected_remote section, you can select the remotes that should be active by default. If you do not add your remote here, you have to always specify it in the CLI with –remotes +mys3. The drawback of adding it here is, that the more remotes you have, the longer querying could take and you might end up with duplicate data.

  5. Check availability:

     >> owilix --remotes mys3 remote doctor
     Found 1 configured Remote Repositories
     mys3: connected
             user=False      project=False   public=False
    

    Since there are no dataset, it should be normal that user, project and public are all false. This requires a certain folder structur in the bucket. It is important, that it tells you abbrevation: connected

2. Pushing full datasets to your S3 bucket#

To fill the bucket, run an owi command to select datasets and pull them into the bucket (we use the mys3 remote assuming it was configured above)

owilix --remotes +mys3,lrz,it4i remote pull all:2025-04-15#14/collectionName=main push_to_remote=mys3 num_threads=10

We use remotes lrz and it4, which are configured by default and host the main OWI index, and add our own s3 in addtion (the little +). we then pull all datasets from the main index for 14 days before the 15th April 2025 and push them to our own s3 bucket. We use 10 threads to speed up the process. it should take about 30 minutes / dataset on a good internet connection.

Note

As of now, long running jobs might fail due to an bug in the platform. However, syncing can continue from unsynced files by simply re-running the command.

3. Working with datasets in your S3 bucket#

By specifying --remotes mys3 all owilix commands run on your s3 bucket. Some examples:

Querying your S3 datasets#

List all datasets:

owilix --remotes mys3 remote ls

Slicing and creating new datasets:#

Slicing refers to the process of creating new datasets by materializing (several) queries over existing datasets.

  • Step 1: Creating slices on your local machine from existing datasets on your s3

    query --remotes mys3 slice --remote all:latest/collectionName=main creator="Max Mustermann, MyS3Orga" "where=url_suffix='at'" prefetch=200 collection_name="all_of_austria"

    By only selecting the remote mys3 with --remotes, we are limiting the query and thus dataset creation to datasets stored add. Note that --remote (without the s) is part of the query command, which allows you to select remote and local datasets to be queried / sliced.

  • Step 1a(optional): Inspect the local slice

  • Step 2: Push the slice to the remote (TODO)

  • Step X: Delete the slices If you are not happy with the slices, you can delete the slices / all of your collection easily with

    • locally: owilix local remove all/collectionName=all_of_austria

    • remotely: owilix --remotes mys3 remote remove all/collectionName=all_of_austria

Note

Currently, slicing requires to create the new dataset on a local machine and then upload it again. This might change in the future.