Tutorial 11: Building an OWI Lake#

An OpenWebIndex Lake (OWILake) is a subset of the index on a specific subject area or domain, that is build by integrating daily index shards. It can be understood as a staging area for further activities, like indexing, quality esimations or merging with external data.

The OWILake is build on top of duckdb utilizing the new DuckLake feature of DuckDB. In short, it uses a relational databases, particularly postgres, for managing schema and version information for data that is stored in a data repository (mostly s3 +parquet).

The OpenWebIndex supports the OWILake with several mechanisms:

  1. You can fill an OWILake from daily index shards using owilix (shown in this tutorial)

  2. We can provide Postgres databases for storing / managing schema information

  3. We can publish OWILakes on the Dashboard to announce it to the public

  4. We provide mechanisms for managing, indexing and search OWILakes (under construction)

For the tutorial, we assume owilix V0.23.2 is installed.

Details can be found in TODO: Link to owilix help page

Building a Data LAke#

You need:

  1. a postgres database

  2. a data storage

Fill a local OWILake with Docker based Postgres#

Create a datalake with URLs containing bavaria in its title docker run --rm -p 5432:5432 -e POSTGRES_PASSWORD=duck -e POSTGRES_USER=duck -e POSTGRES_DB=ducklake postgres:16

If you do not already have datasets locally, pull some first (you can also use remote querying, but that is usually slower when selecting all the data)

owilix remote pull all:latest#2

Import to the data lake

owilix owilake fill --local all:latest#2/collectionName=main catalog_name=bayern_lake "where=url ilike '%bayern%' or url ilike '%bavaria%'" "catalog_connect=pg:duck:duck@postgresql://localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data" buffer_rows=10000

Here

  • where is the selection filter for rows

  • local selects the latest 2 datasets availabel locally

  • catalog_name identifies your catalog

  • catalog_connect specifies the local postgres database started via docker

  • data_connect specifies the connection where the data is stored.

  • There are further parameters like buffer_row which allows to control flushing behvaiour and

Note

  • that we configure access deatils etc. fully via the catalog and data connect strings, which can also contain environment variables

  • that the fill command will select all columns, but only those rows specified in the where clause.

  • that the fill command also has a resume mechanism per owi dataset+filename. So one can avoid multiple fills with the same data. However, the last fill will always overwrite records with the same id.

For more details on the command use owilix owilake help fill

OWI Lake Compaction#

Data lakes need to be compated from time to time, to reduce the amount of small parquet files and merge /delete files with tombstone entries.

Owilix supports lightweight compaction:

owilix owilake help compact

Example:

owilix owilake compact catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"

Output:

DuckLake Options:
           option_name                                        description                                         value   scope scope_entry
0           created_by                    Tool used to write the DuckLake                             DuckDB 0b83e5d2f6  GLOBAL        None
1            data_path                                 Path to data files  file:/home/mgrani/tmp/owilake/ducklake_data/  GLOBAL        None
2            encrypted  Whether or not to encrypt Parquet files writte...                                         false  GLOBAL        None
3  parquet_compression  Compression algorithm for Parquet files (uncom...                                          zstd  GLOBAL        None
4     target_file_size  The target data file size for insertion and co...                                     100000000  GLOBAL        None
5              version                            DuckLake format version                                           0.2  GLOBAL        None
Before compaction  latest snapshot:
   snapshot_id                    snapshot_time  schema_version                          changes
0         1052 2025-09-11 12:44:05.519000+02:00               8  {'tables_inserted_into': ['2']}
Merging adjacent files…
Cleaning files older than 7 days…
After compaction  latest snapshot:
   snapshot_id                    snapshot_time  schema_version changes
0         1053 2025-09-11 12:51:03.139000+02:00               8      {}
New snapshot created: 1053 (was 1052).

If you need more, you have to setup your own workflows.

Querying a data lake#

owilix owilake query sql="Select * from owi.main limit 10" catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"

DuckLake Options:
           option_name                                        description                                         value   scope scope_entry
0           created_by                    Tool used to write the DuckLake                             DuckDB 0b83e5d2f6  GLOBAL        None
1            data_path                                 Path to data files  file:/home/mgrani/tmp/owilake/ducklake_data/  GLOBAL        None
2            encrypted  Whether or not to encrypt Parquet files writte...                                         false  GLOBAL        None
3  parquet_compression  Compression algorithm for Parquet files (uncom...                                          zstd  GLOBAL        None
4     target_file_size  The target data file size for insertion and co...                                     100000000  GLOBAL        None
5              version                            DuckLake format version                                           0.2  GLOBAL        None
                                                  id                             record_id                                              title  ... crawling_error ows_referer ows_tags
0  e77295ba9e2b20fb63872596b85376b791daba99e11ded...  02d133d8-a06e-407f-94eb-2ffdb6553d46                                Medien | LFV Bayern  ...           None        None   [HTML]
1  e66c9a8206462585340e7b8bb6c4ce24e1c92aae309529...  960ba17e-7e02-4bf4-b256-5545a00dc54c    Artikel mit Logo BDS Bayern e.V. – reitshop4you  ...           None        None   [HTML]
2  e7923555c7c0d2ffb4304d3a173845082da235ce3682b2...  7b1c4400-7734-4f55-866b-4cdfc0aa43ca            Selbstversorgerhaus in Ostbayern finden  ...           None        None   [HTML]
3  e73e1d7d89811bf89f779d6109aaf07b69307467a089ae...  30480219-850f-411e-b662-a6f042209359  • Ingolstadt - Geld leihen für Autokauf ohne A...  ...           None        None   [HTML]
4  e74c131c3702224a6a0919750d443d99b3d1960d379e60...  fa241aec-446b-4721-921e-85d8eef2ac36  Messung und Bewertung elektromagnetischer Feld...  ...           None        None   [HTML]
5  e74e9d9a7bb987f5e6c7b1684ba769a4a32df4c7b3c9e2...  f683c74e-6e30-440b-8740-33d82f9e7099  Bayerns Meisterwerke. Eine Gesprächsreihe auf ...  ...           None        None   [HTML]
6  e6743b1024f06e93f1225456cb8995270ba5a9b4540093...  ad81b62a-aa3b-41c7-bb52-a1863d19929b  Bayerns Gesundheitsminister und das KWS beim A...  ...           None        None   [HTML]
7  e6a8c905dd648f2a0d1e1eb07af52f9f7c51607383c835...  d51d4dd0-fcd1-401b-abf8-f033045b2fe5  Informationen - Dienstleistungen - Formalitäte...  ...           None        None   [HTML]
8  e7aee33170fdfb5cd6e5aa52657045332e47a4b1fb5348...  7ea44b5c-0ea9-4030-b982-847c7dcbd8e7  BESTE Aktivitäten an Regentagen Bayern 2024 – ...  ...           None        None   [HTML]
9  e7838a0fd86ce67fe82e2d507b11802f2b2edad7eeba25...  6809a758-3c03-4a22-9222-d7c0b0379895      cropped-Logo_ERDenboden.jpg | Realwert-Bayern  ...           None        None   [HTML]

[10 rows x 52 columns]

Getting info on the OWI LAKE#

owilix owilake info catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"

DuckLake Options:
           option_name                                        description                                         value   scope scope_entry
0           created_by                    Tool used to write the DuckLake                             DuckDB 0b83e5d2f6  GLOBAL        None
1            data_path                                 Path to data files  file:/home/mgrani/tmp/owilake/ducklake_data/  GLOBAL        None
2            encrypted  Whether or not to encrypt Parquet files writte...                                         false  GLOBAL        None
3  parquet_compression  Compression algorithm for Parquet files (uncom...                                          zstd  GLOBAL        None
4     target_file_size  The target data file size for insertion and co...                                     100000000  GLOBAL        None
5              version                            DuckLake format version                                           0.2  GLOBAL        None
Recent snapshots
   snapshot_id                    snapshot_time  schema_version                                            changes
0         1053 2025-09-11 12:51:03.139000+02:00               8                                                 {}
1         1052 2025-09-11 12:44:05.519000+02:00               8                    {'tables_inserted_into': ['2']}
2         1051 2025-09-10 17:07:05.157000+02:00               8  {'tables_inserted_into': ['2'], 'tables_delete...
3         1050 2025-09-10 17:07:01.232000+02:00               8                    {'tables_inserted_into': ['2']}
4         1049 2025-09-10 17:06:59.703000+02:00               8  {'tables_inserted_into': ['2'], 'tables_delete...
5         1048 2025-09-10 17:06:58.279000+02:00               8  {'tables_inserted_into': ['2'], 'tables_delete...
6         1047 2025-09-10 17:06:57.130000+02:00               8                    {'tables_inserted_into': ['2']}
7         1046 2025-09-10 17:06:55.761000+02:00               8                    {'tables_inserted_into': ['2']}
8         1045 2025-09-10 17:06:55.052000+02:00               8  {'tables_inserted_into': ['2'], 'tables_delete...
9         1044 2025-09-10 17:06:51.686000+02:00               8  {'tables_inserted_into': ['2'], 'tables_delete...
Tables
         schema_name                             table_name
0   meta_bayern_lake                       imported_sources
1          meta_lake                       imported_sources
2                owi                                   main
3             public                        ducklake_column
4             public                ducklake_column_mapping
5             public                    ducklake_column_tag
6             public                     ducklake_data_file
7             public                   ducklake_delete_file
8             public        ducklake_file_column_statistics
9             public          ducklake_file_partition_value
10            public  ducklake_files_scheduled_for_deletion
11            public           ducklake_inlined_data_tables
12            public                      ducklake_metadata
13            public                  ducklake_name_mapping
14            public              ducklake_partition_column
15            public                ducklake_partition_info
16            public                        ducklake_schema
17            public                      ducklake_snapshot
18            public              ducklake_snapshot_changes
19            public                         ducklake_table
20            public            ducklake_table_column_stats
21            public                   ducklake_table_stats
22            public                           ducklake_tag
23            public                          ducklake_view
Manifest – datasets imported
                             dataset_id  files
0  9bb61f5a-7b76-11f0-8f7f-c2f2eded91dd     11
1  f6071936-7be0-11f0-b039-c2f2eded91dd      9

Purging a catalog#

owilix owilake purge catalog_name=bayern_lake "catalog_connect=postgresql://duck:duck@localhost:5432/ducklake" "data_connect=file:/home/mgrani/tmp/owilake/ducklake_data"

DuckLake Options:
           option_name                                        description                                         value   scope scope_entry
0           created_by                    Tool used to write the DuckLake                             DuckDB 0b83e5d2f6  GLOBAL        None
1            data_path                                 Path to data files  file:/home/mgrani/tmp/owilake/ducklake_data/  GLOBAL        None
2            encrypted  Whether or not to encrypt Parquet files writte...                                         false  GLOBAL        None
3  parquet_compression  Compression algorithm for Parquet files (uncom...                                          zstd  GLOBAL        None
4     target_file_size  The target data file size for insertion and co...                                     100000000  GLOBAL        None
5              version                            DuckLake format version                                           0.2  GLOBAL        None
Purging lake:  bayern_lake
         schema_name                             table_name
0   meta_bayern_lake                       imported_sources
1          meta_lake                       imported_sources
2                owi                                   main
3             public                        ducklake_column
4             public                ducklake_column_mapping
5             public                    ducklake_column_tag
6             public                     ducklake_data_file
7             public                   ducklake_delete_file
8             public        ducklake_file_column_statistics
9             public          ducklake_file_partition_value
10            public  ducklake_files_scheduled_for_deletion
11            public           ducklake_inlined_data_tables
12            public                      ducklake_metadata
13            public                  ducklake_name_mapping
14            public              ducklake_partition_column
15            public                ducklake_partition_info
16            public                        ducklake_schema
17            public                      ducklake_snapshot
18            public              ducklake_snapshot_changes
19            public                         ducklake_table
20            public            ducklake_table_column_stats
21            public                   ducklake_table_stats
22            public                           ducklake_tag
23            public                          ducklake_view
Dropped manifest schema bayern_lake_manifest_pg.meta_bayern_lake

Using a remote S3 and Remote postgres Database#

Create an .env file

export OWILAKE_PQ_USERNAME='me'
export OWILAKE_PQ_PWD='adfasdfas'
#s3://2006391-ows-ducklake/bayern_lake?endpoint_url=${OWILAKE_S3_SERVER}&aws_access_key_id=${OWILAKE_S3_KEY}&aws_secret_access_key=${OWILAKE_S3_SECRET}&url_style=path"
export OWILAKE_S3_SERVER='https://ENDPOINT.COM'
export OWILAKE_S3_KEY='yes'
export OWILAKE_S3_SECRET='no'

owilake fill --local all:latest#2/collectionName=main catalog_name=bayern_lake "where=url like '%bayern%' or url like '%Bayern%' or url like '%BAYERN%' or url like '%bavaria%' or url like '%Bavaria%'" "catalog_connect=postgresql://${OWILAKE_PQ_USERNAME}:${OWILAKE_PQ_PWD}@193.167.189.42:5432/ducklake" "data_connect=s3://2006391-ows-ducklake/bayern_lake?endpoint_url=${OWILAKE_S3_SERVER}&aws_access_key_id=${OWILAKE_S3_KEY}&aws_secret_access_key=${OWILAKE_S3_SECRET}&url_style=path"