Tutorial 11: Building an OWI Lake#
An OpenWebIndex Lake (OWILake) is a subset of the index on a specific subject area or domain, that is build by integrating daily index shards. It can be understood as a staging area for further activities, like indexing, quality esimations or merging with external data.
The OWILake is build on top of duckdb utilizing the new DuckLake feature of DuckDB. In short, it uses a relational databases, particularly postgres, for managing schema and version information for data that is stored in a data repository (mostly s3 +parquet).
The OpenWebIndex supports the OWILake with several mechanisms:
You can fill an OWILake from daily index shards using
owilix(shown in this tutorial)We can provide Postgres databases for storing / managing schema information
We can publish OWILakes on the Dashboard to announce it to the public
We provide mechanisms for managing, indexing and search OWILakes (under construction)
For the tutorial, we assume owilix V0.23.2 is installed.
Details can be found in TODO: Link to owilix help page
Building a Data LAke#
You need:
a postgres database
a data storage
Fill a local OWILake with Docker based Postgres#
Create a datalake with URLs containing bavaria in its title
docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD=duck -e POSTGRES_USER=duck -e POSTGRES_DB=ducklake postgres:16
If you do not already have datasets locally, pull some first (you can also use remote querying, but that is usually slower when selecting all the data)
owilix remote pull all:latest#2
Import to the data lake
owilix owilake fill --local all:latest#2/collectionName=main catalog_name=bayern_lake "where=url ilike '%bayern%' or url ilike '%bavaria%'" "catalog_connect=pg:duck:duck@postgresql://localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data" buffer_rows=10000

Here
where is the selection filter for rows
local selects the latest 2 datasets availabel locally
catalog_name identifies your catalog
catalog_connect specifies the local postgres database started via docker
data_connect specifies the connection where the data is stored.
There are further parameters like
buffer_rowwhich allows to control flushing behvaiour and
Note
that we configure access deatils etc. fully via the catalog and data connect strings, which can also contain environment variables
that the fill command will select all columns, but only those rows specified in the where clause.
that the fill command also has a resume mechanism per owi dataset+filename. So one can avoid multiple fills with the same data. However, the last fill will always overwrite records with the same id.
For more details on the command use owilix owilake help fill
OWI Lake Compaction#
Data lakes need to be compated from time to time, to reduce the amount of small parquet files and merge /delete files with tombstone entries.
Owilix supports lightweight compaction:
owilix owilake help compact
Example:
owilix owilake compact catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"
Output:
DuckLake Options:
option_name description value scope scope_entry
0 created_by Tool used to write the DuckLake DuckDB 0b83e5d2f6 GLOBAL None
1 data_path Path to data files file:/home/mgrani/tmp/owilake/ducklake_data/ GLOBAL None
2 encrypted Whether or not to encrypt Parquet files writte... false GLOBAL None
3 parquet_compression Compression algorithm for Parquet files (uncom... zstd GLOBAL None
4 target_file_size The target data file size for insertion and co... 100000000 GLOBAL None
5 version DuckLake format version 0.2 GLOBAL None
Before compaction – latest snapshot:
snapshot_id snapshot_time schema_version changes
0 1052 2025-09-11 12:44:05.519000+02:00 8 {'tables_inserted_into': ['2']}
Merging adjacent files…
Cleaning files older than 7 days…
After compaction – latest snapshot:
snapshot_id snapshot_time schema_version changes
0 1053 2025-09-11 12:51:03.139000+02:00 8 {}
New snapshot created: 1053 (was 1052).
If you need more, you have to setup your own workflows.
Querying a data lake#
owilix owilake query sql="Select * from owi.main limit 10" catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"
DuckLake Options:
option_name description value scope scope_entry
0 created_by Tool used to write the DuckLake DuckDB 0b83e5d2f6 GLOBAL None
1 data_path Path to data files file:/home/mgrani/tmp/owilake/ducklake_data/ GLOBAL None
2 encrypted Whether or not to encrypt Parquet files writte... false GLOBAL None
3 parquet_compression Compression algorithm for Parquet files (uncom... zstd GLOBAL None
4 target_file_size The target data file size for insertion and co... 100000000 GLOBAL None
5 version DuckLake format version 0.2 GLOBAL None
id record_id title ... crawling_error ows_referer ows_tags
0 e77295ba9e2b20fb63872596b85376b791daba99e11ded... 02d133d8-a06e-407f-94eb-2ffdb6553d46 Medien | LFV Bayern ... None None [HTML]
1 e66c9a8206462585340e7b8bb6c4ce24e1c92aae309529... 960ba17e-7e02-4bf4-b256-5545a00dc54c Artikel mit Logo BDS Bayern e.V. – reitshop4you ... None None [HTML]
2 e7923555c7c0d2ffb4304d3a173845082da235ce3682b2... 7b1c4400-7734-4f55-866b-4cdfc0aa43ca Selbstversorgerhaus in Ostbayern finden ... None None [HTML]
3 e73e1d7d89811bf89f779d6109aaf07b69307467a089ae... 30480219-850f-411e-b662-a6f042209359 • Ingolstadt - Geld leihen für Autokauf ohne A... ... None None [HTML]
4 e74c131c3702224a6a0919750d443d99b3d1960d379e60... fa241aec-446b-4721-921e-85d8eef2ac36 Messung und Bewertung elektromagnetischer Feld... ... None None [HTML]
5 e74e9d9a7bb987f5e6c7b1684ba769a4a32df4c7b3c9e2... f683c74e-6e30-440b-8740-33d82f9e7099 Bayerns Meisterwerke. Eine Gesprächsreihe auf ... ... None None [HTML]
6 e6743b1024f06e93f1225456cb8995270ba5a9b4540093... ad81b62a-aa3b-41c7-bb52-a1863d19929b Bayerns Gesundheitsminister und das KWS beim A... ... None None [HTML]
7 e6a8c905dd648f2a0d1e1eb07af52f9f7c51607383c835... d51d4dd0-fcd1-401b-abf8-f033045b2fe5 Informationen - Dienstleistungen - Formalitäte... ... None None [HTML]
8 e7aee33170fdfb5cd6e5aa52657045332e47a4b1fb5348... 7ea44b5c-0ea9-4030-b982-847c7dcbd8e7 BESTE Aktivitäten an Regentagen Bayern 2024 – ... ... None None [HTML]
9 e7838a0fd86ce67fe82e2d507b11802f2b2edad7eeba25... 6809a758-3c03-4a22-9222-d7c0b0379895 cropped-Logo_ERDenboden.jpg | Realwert-Bayern ... None None [HTML]
[10 rows x 52 columns]
Getting info on the OWI LAKE#
owilix owilake info catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"
DuckLake Options:
option_name description value scope scope_entry
0 created_by Tool used to write the DuckLake DuckDB 0b83e5d2f6 GLOBAL None
1 data_path Path to data files file:/home/mgrani/tmp/owilake/ducklake_data/ GLOBAL None
2 encrypted Whether or not to encrypt Parquet files writte... false GLOBAL None
3 parquet_compression Compression algorithm for Parquet files (uncom... zstd GLOBAL None
4 target_file_size The target data file size for insertion and co... 100000000 GLOBAL None
5 version DuckLake format version 0.2 GLOBAL None
Recent snapshots
snapshot_id snapshot_time schema_version changes
0 1053 2025-09-11 12:51:03.139000+02:00 8 {}
1 1052 2025-09-11 12:44:05.519000+02:00 8 {'tables_inserted_into': ['2']}
2 1051 2025-09-10 17:07:05.157000+02:00 8 {'tables_inserted_into': ['2'], 'tables_delete...
3 1050 2025-09-10 17:07:01.232000+02:00 8 {'tables_inserted_into': ['2']}
4 1049 2025-09-10 17:06:59.703000+02:00 8 {'tables_inserted_into': ['2'], 'tables_delete...
5 1048 2025-09-10 17:06:58.279000+02:00 8 {'tables_inserted_into': ['2'], 'tables_delete...
6 1047 2025-09-10 17:06:57.130000+02:00 8 {'tables_inserted_into': ['2']}
7 1046 2025-09-10 17:06:55.761000+02:00 8 {'tables_inserted_into': ['2']}
8 1045 2025-09-10 17:06:55.052000+02:00 8 {'tables_inserted_into': ['2'], 'tables_delete...
9 1044 2025-09-10 17:06:51.686000+02:00 8 {'tables_inserted_into': ['2'], 'tables_delete...
Tables
schema_name table_name
0 meta_bayern_lake imported_sources
1 meta_lake imported_sources
2 owi main
3 public ducklake_column
4 public ducklake_column_mapping
5 public ducklake_column_tag
6 public ducklake_data_file
7 public ducklake_delete_file
8 public ducklake_file_column_statistics
9 public ducklake_file_partition_value
10 public ducklake_files_scheduled_for_deletion
11 public ducklake_inlined_data_tables
12 public ducklake_metadata
13 public ducklake_name_mapping
14 public ducklake_partition_column
15 public ducklake_partition_info
16 public ducklake_schema
17 public ducklake_snapshot
18 public ducklake_snapshot_changes
19 public ducklake_table
20 public ducklake_table_column_stats
21 public ducklake_table_stats
22 public ducklake_tag
23 public ducklake_view
Manifest – datasets imported
dataset_id files
0 9bb61f5a-7b76-11f0-8f7f-c2f2eded91dd 11
1 f6071936-7be0-11f0-b039-c2f2eded91dd 9
Purging a catalog#
owilix owilake purge catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"
DuckLake Options:
option_name description value scope scope_entry
0 created_by Tool used to write the DuckLake DuckDB 0b83e5d2f6 GLOBAL None
1 data_path Path to data files file:/home/mgrani/tmp/owilake/ducklake_data/ GLOBAL None
2 encrypted Whether or not to encrypt Parquet files writte... false GLOBAL None
3 parquet_compression Compression algorithm for Parquet files (uncom... zstd GLOBAL None
4 target_file_size The target data file size for insertion and co... 100000000 GLOBAL None
5 version DuckLake format version 0.2 GLOBAL None
Purging lake: bayern_lake
schema_name table_name
0 meta_bayern_lake imported_sources
1 meta_lake imported_sources
2 owi main
3 public ducklake_column
4 public ducklake_column_mapping
5 public ducklake_column_tag
6 public ducklake_data_file
7 public ducklake_delete_file
8 public ducklake_file_column_statistics
9 public ducklake_file_partition_value
10 public ducklake_files_scheduled_for_deletion
11 public ducklake_inlined_data_tables
12 public ducklake_metadata
13 public ducklake_name_mapping
14 public ducklake_partition_column
15 public ducklake_partition_info
16 public ducklake_schema
17 public ducklake_snapshot
18 public ducklake_snapshot_changes
19 public ducklake_table
20 public ducklake_table_column_stats
21 public ducklake_table_stats
22 public ducklake_tag
23 public ducklake_view
Dropped manifest schema bayern_lake_manifest_pg.meta_bayern_lake
Using a remote S3 and Remote postgres Database#
Create an .env file
export OWILAKE_PQ_USERNAME='me'
export OWILAKE_PQ_PWD='adfasdfas'
#s3://2006391-ows-ducklake/bayern_lake?endpoint_url=${OWILAKE_S3_SERVER}&aws_access_key_id=${OWILAKE_S3_KEY}&aws_secret_access_key=${OWILAKE_S3_SECRET}&url_style=path"
export OWILAKE_S3_SERVER='https://ENDPOINT.COM'
export OWILAKE_S3_KEY='yes'
export OWILAKE_S3_SECRET='no'
owilake fill --local all:latest#2/collectionName=main catalog_name=bayern_lake "where=url like '%bayern%' or url like '%Bayern%' or url like '%BAYERN%' or url like '%bavaria%' or url like '%Bavaria%'" "catalog_connect=postgresql://${OWILAKE_PQ_USERNAME}:${OWILAKE_PQ_PWD}@193.167.189.42:5432/ducklake" "data_connect=s3://2006391-ows-ducklake/bayern_lake?endpoint_url=${OWILAKE_S3_SERVER}&aws_access_key_id=${OWILAKE_S3_KEY}&aws_secret_access_key=${OWILAKE_S3_SECRET}&url_style=path"