Tutorial 10. Hosting the OWI on your own S3#
Danger
The tutorial is not ready yet/ fully working. There are still some parsing errors that could not be resovled via the command line.
Scenario#
You want to host (parts of) the OWI index on your own S3 bucket for faster querying
You want to host slices of the OWI index on your own S3 Bucket.
Prerequisites#
Having
owilixinstalledHaving an S3 bucket configured
1. Configure OWILIX to use your S3 bucket#
Got to the owilix config directory:
cd ~/.owilixEdit the “owilix.cfg” file:
vi ~/.owi/owilix.cfgunder repositories.config add the following:
mys3: # name / abbrevation of teh repo. Must not start with + options: # Note: config options follow fsspec conventions protocol: s3a key: <Your Key here> # your key to s3 if it is not public secret: <Your Secret here> # your secret to s3 if it is not public endpoint: <endpoint> # your endpoint to s3 path: owi-public/{access} # path to the bucket. {access} is a placeholder and must be provided async: True repository: s3a
(Optional, but recommended): under the
selected_remotesection, you can select the remotes that should be active by default. If you do not add your remote here, you have to always specify it in the CLI with –remotes +mys3. The drawback of adding it here is, that the more remotes you have, the longer querying could take and you might end up with duplicate data.Check availability:
>> owilix --remotes mys3 remote doctor Found 1 configured Remote Repositories mys3: connected user=False project=False public=False
Since there are no dataset, it should be normal that user, project and public are all false. This requires a certain folder structur in the bucket. It is important, that it tells you
abbrevation: connected
2. Pushing full datasets to your S3 bucket#
To fill the bucket, run an owi command to select datasets and pull them into the bucket (we use the mys3 remote assuming it was configured above)
owilix --remotes +mys3,lrz,it4i remote pull all:2025-04-15#14/collectionName=main push_to_remote=mys3 num_threads=10
We use remotes lrz and it4, which are configured by default and host the main OWI index, and add our own s3 in addtion (the little +).
we then pull all datasets from the main index for 14 days before the 15th April 2025 and push them to our own s3 bucket.
We use 10 threads to speed up the process. it should take about 30 minutes / dataset on a good internet connection.
Note
As of now, long running jobs might fail due to an bug in the platform. However, syncing can continue from unsynced files by simply re-running the command.
3. Working with datasets in your S3 bucket#
By specifying --remotes mys3 all owilix commands run on your s3 bucket. Some examples:
Querying your S3 datasets#
List all datasets:
owilix --remotes mys3 remote ls
Slicing and creating new datasets:#
Slicing refers to the process of creating new datasets by materializing (several) queries over existing datasets.
Step 1: Creating slices on your local machine from existing datasets on your s3
query --remotes mys3 slice --remote all:latest/collectionName=main creator="Max Mustermann, MyS3Orga" "where=url_suffix='at'" prefetch=200 collection_name="all_of_austria"By only selecting the remote
mys3with--remotes, we are limiting the query and thus dataset creation to datasets stored add. Note that--remote(without the s) is part of the query command, which allows you to select remote and local datasets to be queried / sliced.Step 1a(optional): Inspect the local slice
Step 2: Push the slice to the remote (TODO)
Step X: Delete the slices If you are not happy with the slices, you can delete the slices / all of your collection easily with
locally:
owilix local remove all/collectionName=all_of_austriaremotely:
owilix --remotes mys3 remote remove all/collectionName=all_of_austria
Note
Currently, slicing requires to create the new dataset on a local machine and then upload it again. This might change in the future.