Tutorial 12: Uploading Your Own Datasets#
OpenWebSearch.eu allows you to upload and share your datasets through the LEXIS Platform, provided by our HPC partners.
Prerequisites#
Install py4lexis (
pip install py4lexis --index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple)Have an account with permission to use OpenWebSearch.eu project resources
👉 Please contact us if you need access and request access in the Lexis Portal as follows:
Login to the Lexis Portal using the B2Access Federator. You should be able to use your academic home login, ORCID or social login, although we encourage the former too.
Go to Projects and hit Request Project Access. Fill in “openwebsearch” in the form
contact us separately via E-Mail / or Mattermost to give us a reason to give you access.
Step 1 – Login#
Log in with py4lexis:
python -m py4lexis.cli session login-url
Step 2 – Choose a Storage Location#
Determine the available storages for the openwebsearch project:
python -m py4lexis.cli datasets get-project-storages openwebsearch
Step 3 – Prepare Metadata (DataCite JSON)#
Fill out a DataCite JSON template. A minimal example looks like this:
{
"creators": [
{
"name": "THE CREATOR"
}
],
"titles": [
{
"lang": "en",
"title": "THE TITLE"
}
],
"publisher": {
"name": "THE CREATOR"
},
"types": {
"resourceType": "Dataset",
"resourceTypeGeneral": "Dataset"
},
"publicationYear": 2025
}
You can also generate a minimal JSON via Python:
from py4lexis.models.datacite_model import Datacite
print(Datacite.get_basic_data("THE CREATOR", "THE TITLE"))
Step 4 – (Optional) Provide a More Complete DataCite JSON#
We recommend using a more complete DataCite JSON (this can also be edited later via the portal):
{
"creators": [
{
"name": "THE CREATOR"
}
],
"titles": [
{
"lang": "en",
"title": "THE TITLE"
}
],
"publisher": {
"name": "THE CREATOR"
},
"types": {
"resourceType": "Dataset",
"resourceTypeGeneral": "Dataset"
},
"publicationYear": 2025,
"dates": [
{
"date": "2025-06-09T00:00:00",
"dateType": "Updated"
},
{
"date": "2025-06-09T00:00:00",
"dateType": "Other",
"dateInformation": "startDate"
},
{
"date": "2025-06-09T00:00:00",
"dateType": "Other",
"dateInformation": "endDate"
}
],
"fundingReferences": [
{
"funderName": "European Commission",
"funderIdentifier": "https://doi.org/10.13039/501100000780",
"funderIdentifierType": "Crossref Funder ID",
"awardNumber": "101070014",
"awardUri": "https://cordis.europa.eu/project/id/101070014",
"awardTitle": "Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty"
}
]
}
👉 objectCount should be set to the number of rows if available.
Step 5 – Create the Dataset#
Add additional metadata for findability:
"additionalMetadata": {
"collectionName": "special",
"totalSize": 192781845880,
"fileCount": 3361,
"objectCount": 59215037
}
Command template:
python -m py4lexis.cli datasets create-dataset project openwebsearch "iRODS IT4I" "iRODS LEXIS V2" --title "THE TITLE" --datacite '{
"creators": [{"name": "THE CREATOR"}],
"titles": [{"lang": "en", "title": "THE TITLE"}],
"publisher": {"name": "THE CREATOR"},
"types": {"resourceType": "owix", "resourceTypeGeneral": "Dataset"},
"publicationYear": 2025,
"dates": [
{"date": "2025-06-09T00:00:00", "dateType": "Updated"},
{"date": "2025-06-09T00:00:00", "dateType": "Other", "dateInformation": "startDate"},
{"date": "2025-06-09T00:00:00", "dateType": "Other", "dateInformation": "endDate"}
],
"fundingReferences": [
{
"funderName": "European Commission",
"funderIdentifier": "https://doi.org/10.13039/501100000780",
"funderIdentifierType": "Crossref Funder ID",
"awardNumber": "101070014",
"awardUri": "https://cordis.europa.eu/project/id/101070014",
"awardTitle": "Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty"
}
]
}' --additional-metadata '{
"collectionName": "special",
"totalSize": 192781845880,
"fileCount": 3361,
"objectCount": 59215037
}'
Use
projectfor restricted datasets (project members only).Use
publicto make the dataset visible to all users.
The command will return a unique dataset ID, e.g.:
The dataset was successfully created with dataset ID: 4322b10c-8fa6-11f0-b931-c687956b5905
You can also find the dataset ID in the Lexis Portal → Datasets → Copy ID.
Step 6 – Review and Edit Metadata#
Go to the Lexis Portal → Uploads to review your dataset.
Metadata can be edited there (e.g., license, related publications).
Publications should be listed under
relatedIdentifiers.

Step 7 – Upload Files#
Example for uploading a file:
python -m py4lexis.cli datasets tus-uploader-rewrite 4322b10c-8fa6-11f0-b931-c687956b5905 openwebsearch cc/myfile.parquet "iRODS IT4I" "iRODS LEXIS V2" --file-path . --destination-path /
This works for archives and single files.
Directory upload is under development, but you can script uploads (Python or shell).
Step 8 – Upload Directories (Optional, Faster)#
If your firewall allows iRODS connections, you can upload directories directly:
python -m py4lexis.cli lexis-irods upload-directory-to-dataset cc project openwebsearch 4322b10c-8fa6-11f0-b931-c687956b5905
Here we assume that cc is a subfolder of the current directory.
Step 9 – Final Check#
After upload:
Inspect your dataset in the Lexis Portal.
Optionally ask the OpenWebSearch team to star your dataset at the OpenWebIndex portal.
Public datasets can be linked via:
https://portal.lexis.tech/publicDataSets/<DATASETID>/details
TODOs#
Add link to proper dataset description