Tutorial 4: How to Analyse OWI Data - Using Jupyter Notebooks#
In this tutorial we aim to pull data via owilix and analyse it using standard python based tools
Pulling a small collection#
In a first step, we pull the legal
collection using owilix
. Make sure owilix is installed in the environment.
# execute a bash script here.
!owilix remote pull all/collectionName=legal
zsh:1: command not found: owilix
Listing available parquet files#
Per default owilix
puts the data under ~/.owi/public/<collectionNAme>
where collection name is legal
in our case.
The directory has a sub-directory per dataset as well as a .json file per dataset containing metadata about the dataset
Reading and Printing Dataset Metadata#
In a first step, we simly read the metadata and print it as pandas dataframe.
import os
import json
import pandas as pd
# Path to your directory with JSON files
dir_path = os.path.expanduser('~/.owi/public/legal')
# List to store JSON data
data = []
# Iterate over all files in the directory
for filename in os.listdir(dir_path):
# Check if the file is a JSON file
if filename.endswith(".json"):
# Open and read the JSON file
with open(os.path.join(dir_path, filename), 'r') as file:
# Load the JSON data into a Python object
json_data = json.load(file)
# Create a new dictionary to hold the scalar values
scalar_data = {}
for key, value in json_data.items():
if not isinstance(value, (list, dict)):
scalar_data[key] = value
# Append the scalar data to the list
data.append(scalar_data)
# Create a pandas DataFrame from the JSON data
df = pd.DataFrame(data)
# Display the DataFrame
print(df[["title","collectionName", "startDate"]])
title collectionName \
0 OWI-Open Web Index-legal.owip@lrz-2024-3-10:20... legal
1 OWI-Open Web Index-legal.owip@it4i-2024-02-04:... legal
2 OWI-Open Web Index-legal.owip@lrz-2024-2-1:202... legal
3 OWI-Open Web Index-legal.owip@it4i-2024-01-01:... legal
4 OWI-Open Web Index-legal.owip@lrz-2023-10-24:2... legal
5 OWI-Open Web Index-legal.owip@it4i-2023-12-3:2... legal
6 OWI-Open Web Index-legal.owip@lrz-2024-1-1:202... legal
7 OWI-Open Web Index-legal.owip@it4i-2024-03-12:... legal
startDate
0 2024-3-10
1 2024-02-04
2 2024-2-1
3 2024-01-01
4 2023-10-24
5 2023-12-3
6 2024-1-1
7 2024-03-12
Reading Parquet files and esimtating statistics#
The main content is contained in parquet files contained in the subdirectory of all datasets. To access those files we first write a function collecting all parquet files with a certain path pattern. In our case we aim for files contained in language=eng
sub-folders, as these are pages identified as english.
import os
def collect_parquet_files(directory, pattern):
"""
Collects all parquet files with a certain path pattern.
Args:
directory (str): The directory to start searching from.
pattern (str): The pattern to match in the file path. For example, 'language=eng/*.parquet'.
Returns:
list: A list of file paths matching the pattern.
"""
parquet_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if pattern in root and file.endswith('.parquet'):
parquet_files.append(os.path.join(root, file))
return parquet_files
# Example usage:
directory = dir_path
pattern = 'language=eng'
parquet_files = collect_parquet_files(directory, pattern)
print(f"found {len(parquet_files)} files")
found 9631 files
Accessing the data as pandas dataframe#
We now load the first file as dataframe and print its content:
pd.read_parquet(parquet_files[0])
url | domain_label | title | plain_text | |
---|---|---|---|---|
0 | https://mezger.eu/impressum_en.html | None | Impressum - mezger.eu_EN | • de\n • fr\n\nmezger.eu_EN\n\nSkip navigat... |
1 | https://www.finopulse.com/tenet-compliance-ser... | None | Tenet Compliance Services teams up with Legal ... | Finopulse\n • Manage Money\n • Banking\n ... |
2 | https://www.mixmax.com/legal/security | None | Email Productivity, Sales Engagement & Automat... | Get our blog content delivered to your inbox\n... |
3 | https://anthemparalegal.me/tag/law/ | None | law – Anthem Paralegal | Skip to content\n\nAnthem Paralegal\n\nLegal d... |
4 | https://sellecklegal.com/is-you-spouse-hiding-... | None | Is you spouse hiding assets in a divorce? That... | Skip to content\nSearch\nClose\n • 734-729-44... |
... | ... | ... | ... | ... |
291 | https://www.mudmotorkit.com/forum/ucp.php?mode... | None | Swamp Tribe - User Control Panel - Login | Skip to content\n\n • Quick links\n • FAQ\n ... |
292 | https://www.visitgrey.ca/contact-us | Regional | Contact Us | Grey County's Official Tourism We... | Home\n\nMain navigation\n\n • Home\n • Play\... |
293 | https://www.passionfruittherapy.com/contact/ | None | Contact & Fees | Passionfruit Therapy\n • Welcome\n • Service... |
294 | https://atlanticcityelectric.agentisenergy.com... | None | Contact Us | • Atlantic City Electric\n\nContact Us\n\nIf... |
295 | https://www.tmvjapanese.com/contact.html | None | TMV Japaneese | Tilak Maharashtra Vidyapeeth\n\nDeclared as de... |
296 rows × 4 columns
now we iterate over all files, load every one as pandas frame and count the number of .de
urls in the url
field, i.e. we look for the regular expression of .de/
.
Note that this can take some time, so we limit ourselfs to 10 parquet files
import pandas as pd
import re
def count_de_urls(parquet_files):
"""
Counts the number of .de URLs in the 'url' field of each parquet file.
Args:
parquet_files (list): A list of parquet file paths.
Returns:
dict: A dictionary with the file path as the key and the count of .de URLs as the value.
"""
de_url_counts = {}
for file in parquet_files:
try:
# Load the parquet file as a pandas DataFrame
df = pd.read_parquet(file)
# Use a regular expression to find URLs ending with '.de/' in the 'url' field
de_url_count = df['url'].str.contains(r'.de/', regex=True).sum()
# Store the count in the dictionary
de_url_counts[file] = de_url_count
except Exception as e:
print(f"Error reading file {file}: {e}")
return de_url_counts
# Example usage:
parquet_files = collect_parquet_files(directory, pattern)
de_url_counts = count_de_urls(parquet_files[0:10])
print(de_url_counts)
{'/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=30/language=eng/metadata_0.parquet': 3, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/language=eng/metadata_0.parquet': 14, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=10/day=30/language=eng/metadata_0.parquet': 3, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=10/day=24/language=eng/metadata_0.parquet': 14, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=29/language=eng/metadata_0.parquet': 2, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=08/language=eng/metadata_0.parquet': 7, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=30/language=eng/metadata_0.parquet': 3, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=06/language=eng/metadata_0.parquet': 4, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=07/language=eng/metadata_0.parquet': 7, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=05/language=eng/metadata_0.parquet': 2}
We can alos plot the count.
import pandas as pd
import matplotlib.pyplot as plt
def plot_de_url_counts(de_url_counts):
"""
Converts the de_url_counts dictionary into a pandas DataFrame and plots it.
Args:
de_url_counts (dict): A dictionary with the file path as the key and the count of .de URLs as the value.
"""
# Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(list(de_url_counts.items()), columns=['File', 'DE URL Count'])
df['File'] = df['File'].apply(lambda x: x[-20:])
# Plot the DataFrame
plt.figure(figsize=(10, 6))
plt.bar(df['File'], df['DE URL Count'])
plt.xlabel('File')
plt.ylabel('DE URL Count')
plt.title('DE URL Count per File')
plt.xticks(rotation=90) # Rotate the x-axis labels for better readability
plt.tight_layout() # Ensure the labels fit within the figure
plt.show()
# Example usage:
parquet_files = collect_parquet_files(directory, pattern)
de_url_counts = count_de_urls(parquet_files[0:100])
plot_de_url_counts(de_url_counts)

Filtering for content#
While it is not the most efficient way, we can filter out content using a regular expression. For example, we could scan all impressums that contain Germany in the text and create a separate dataframe from it and plot the dataframe
import pandas as pd
import re
def get_filtered_pages(parquet_files, filter_regex=r'.*Germany.*'):
"""
Returns a DataFrame with the filtered pages non-aggregated.
Args:
parquet_files (list): A list of parquet file paths.
Returns:
pandas.DataFrame: A DataFrame with the filtered pages non-aggregated.
"""
filtered_pages = []
for file in parquet_files:
try:
# Load the parquet file as a pandas DataFrame
df = pd.read_parquet(file)
# Use a regular expression to filter pages containing 'Germany' in the 'url' or 'title' fields
filtered_df = df[(df['url'].str.contains(filter_regex, regex=True)) |
(df['title'].str.contains(filter_regex, regex=True))|
(df['plain_text'].str.contains(filter_regex, regex=True))]
# Add the filtered pages to the list
filtered_pages.append(filtered_df)
except Exception as e:
print(f"Error reading file {file}: {e}")
# Concatenate the filtered pages
filtered_pages_df = pd.concat(filtered_pages, ignore_index=True)
return filtered_pages_df
# Example usage:
parquet_files = collect_parquet_files(directory, pattern)
filtered_pages_df = get_filtered_pages(parquet_files[0:400])
print(filtered_pages_df)
url domain_label \
0 https://mezger.eu/impressum_en.html None
1 https://www.corporate-office-contacts.com/kiewit/ None
2 https://chronic.news/big-pharma-loses-billions... None
3 https://germanyworks.com/helpdesk/contact/ None
4 https://www.flyuia.com/sa/en/contacts/corporate Regional
... ... ...
8421 https://beabongiasca.com/pages/shipping-repair... None
8422 https://caiacosmetics.co.uk/uk/info/terms-and-... None
8423 https://www.valmet.com/about-us/contact-us/aut... Business
8424 https://www.glocomms.de/contact-us/frankfurt None
8425 https://actlegal.com/hu/locations/hungary None
title \
0 Impressum - mezger.eu_EN
1 Kiewit Corporate Office [Contact: Phone Number...
2 Big Pharma Loses Billions with Each State That...
3 Contact - GERMANY WORKS.
4 Corporate clients – UIA (Saudi Arabia)
... ...
8421 \n \n Terms & Conditions\n \n \n ...
8422 Terms and Conditions
8423 Automation contact form
8424 Frankfurt
8425 Hungary - act legal
plain_text
0 • de\n • fr\n\nmezger.eu_EN\n\nSkip navigat...
1 Skip to content\nCorporate Office Contacts\n\n...
2 Sunday, May 28 2023\nLatest\n • Health Canada...
3 • Industries\n • Decarbonisation\n • Downl...
4 AGREE\n\nWE VALUE YOUR PRIVACY\n\nOur website ...
... ...
8421 Skip to content\nThis site has limited support...
8422 Current country\nUK UK GBP\n • Sweden Sweden\...
8423 Valmet - Forward\nLogin icon-login\nicon-close...
8424 Glocomms DE\n • Search for jobs\n • Germ...
8425 Bán & Partners\n\n • Főoldal\n • Hírek\n • ...
[8426 rows x 4 columns]
# Create DataFrame
df = pd.DataFrame(filtered_pages_df)
# Count the 'domain_label' occurrences
domain_label_counts = df['domain_label'].value_counts(dropna=False)
# Plot the histogram
plt.figure(figsize=(8, 6))
domain_label_counts.plot(kind='bar')
plt.xlabel('Domain Label')
plt.ylabel('Count')
plt.title('Histogram of Domain Label Counts')
plt.xticks(rotation=45)
plt.show()
