Tutorial 4: How to Analyse OWI Data - Using Jupyter Notebooks#

In this tutorial we aim to pull data via owilix and analyse it using standard python based tools

Pulling a small collection#

In a first step, we pull the legal collection using owilix. Make sure owilix is installed in the environment.

# execute a bash script here. 
!owilix remote pull all/collectionName=legal
zsh:1: command not found: owilix

Listing available parquet files#

Per default owilix puts the data under ~/.owi/public/<collectionNAme> where collection name is legalin our case.

The directory has a sub-directory per dataset as well as a .json file per dataset containing metadata about the dataset

Reading and Printing Dataset Metadata#

In a first step, we simly read the metadata and print it as pandas dataframe.

import os
import json
import pandas as pd

# Path to your directory with JSON files
dir_path = os.path.expanduser('~/.owi/public/legal')

# List to store JSON data
data = []
# Iterate over all files in the directory
for filename in os.listdir(dir_path):
    # Check if the file is a JSON file
    if filename.endswith(".json"):
        # Open and read the JSON file
        with open(os.path.join(dir_path, filename), 'r') as file:
            # Load the JSON data into a Python object
            json_data = json.load(file)
            
            # Create a new dictionary to hold the scalar values
            scalar_data = {}
            for key, value in json_data.items():
                if not isinstance(value, (list, dict)):
                    scalar_data[key] = value
            
            # Append the scalar data to the list
            data.append(scalar_data)

# Create a pandas DataFrame from the JSON data
df = pd.DataFrame(data)

# Display the DataFrame
print(df[["title","collectionName", "startDate"]])
                                               title collectionName  \
0  OWI-Open Web Index-legal.owip@lrz-2024-3-10:20...          legal   
1  OWI-Open Web Index-legal.owip@it4i-2024-02-04:...          legal   
2  OWI-Open Web Index-legal.owip@lrz-2024-2-1:202...          legal   
3  OWI-Open Web Index-legal.owip@it4i-2024-01-01:...          legal   
4  OWI-Open Web Index-legal.owip@lrz-2023-10-24:2...          legal   
5  OWI-Open Web Index-legal.owip@it4i-2023-12-3:2...          legal   
6  OWI-Open Web Index-legal.owip@lrz-2024-1-1:202...          legal   
7  OWI-Open Web Index-legal.owip@it4i-2024-03-12:...          legal   

    startDate  
0   2024-3-10  
1  2024-02-04  
2    2024-2-1  
3  2024-01-01  
4  2023-10-24  
5   2023-12-3  
6    2024-1-1  
7  2024-03-12  

Reading Parquet files and esimtating statistics#

The main content is contained in parquet files contained in the subdirectory of all datasets. To access those files we first write a function collecting all parquet files with a certain path pattern. In our case we aim for files contained in language=eng sub-folders, as these are pages identified as english.

import os

def collect_parquet_files(directory, pattern):
    """
    Collects all parquet files with a certain path pattern.

    Args:
    directory (str): The directory to start searching from.
    pattern (str): The pattern to match in the file path. For example, 'language=eng/*.parquet'.

    Returns:
    list: A list of file paths matching the pattern.
    """
    parquet_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if pattern in root and file.endswith('.parquet'):
                parquet_files.append(os.path.join(root, file))
    return parquet_files

# Example usage:
directory = dir_path
pattern = 'language=eng'

parquet_files = collect_parquet_files(directory, pattern)
print(f"found {len(parquet_files)} files") 
found 9631 files

Accessing the data as pandas dataframe#

We now load the first file as dataframe and print its content:

pd.read_parquet(parquet_files[0])
url domain_label title plain_text
0 https://mezger.eu/impressum_en.html None Impressum - mezger.eu_EN • de\n • fr\n\nmezger.eu_EN\n\nSkip navigat...
1 https://www.finopulse.com/tenet-compliance-ser... None Tenet Compliance Services teams up with Legal ... Finopulse\n • Manage Money\n • Banking\n ...
2 https://www.mixmax.com/legal/security None Email Productivity, Sales Engagement & Automat... Get our blog content delivered to your inbox\n...
3 https://anthemparalegal.me/tag/law/ None law – Anthem Paralegal Skip to content\n\nAnthem Paralegal\n\nLegal d...
4 https://sellecklegal.com/is-you-spouse-hiding-... None Is you spouse hiding assets in a divorce? That... Skip to content\nSearch\nClose\n • 734-729-44...
... ... ... ... ...
291 https://www.mudmotorkit.com/forum/ucp.php?mode... None Swamp Tribe - User Control Panel - Login Skip to content\n\n • Quick links\n • FAQ\n ...
292 https://www.visitgrey.ca/contact-us Regional Contact Us | Grey County's Official Tourism We... Home\n\nMain navigation\n\n • Home\n • Play\...
293 https://www.passionfruittherapy.com/contact/ None Contact & Fees Passionfruit Therapy\n • Welcome\n • Service...
294 https://atlanticcityelectric.agentisenergy.com... None Contact Us • Atlantic City Electric\n\nContact Us\n\nIf...
295 https://www.tmvjapanese.com/contact.html None TMV Japaneese Tilak Maharashtra Vidyapeeth\n\nDeclared as de...

296 rows × 4 columns

now we iterate over all files, load every one as pandas frame and count the number of .de urls in the url field, i.e. we look for the regular expression of .de/.

Note that this can take some time, so we limit ourselfs to 10 parquet files

import pandas as pd
import re

def count_de_urls(parquet_files):
    """
    Counts the number of .de URLs in the 'url' field of each parquet file.

    Args:
    parquet_files (list): A list of parquet file paths.

    Returns:
    dict: A dictionary with the file path as the key and the count of .de URLs as the value.
    """
    de_url_counts = {}
    for file in parquet_files:
        try:
            # Load the parquet file as a pandas DataFrame
            df = pd.read_parquet(file)

            # Use a regular expression to find URLs ending with '.de/' in the 'url' field
            de_url_count = df['url'].str.contains(r'.de/', regex=True).sum()

            # Store the count in the dictionary
            de_url_counts[file] = de_url_count
        except Exception as e:
            print(f"Error reading file {file}: {e}")
    
    return de_url_counts

# Example usage:
parquet_files = collect_parquet_files(directory, pattern)
de_url_counts = count_de_urls(parquet_files[0:10])

print(de_url_counts)
{'/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=30/language=eng/metadata_0.parquet': 3, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/language=eng/metadata_0.parquet': 14, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=10/day=30/language=eng/metadata_0.parquet': 3, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=10/day=24/language=eng/metadata_0.parquet': 14, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=29/language=eng/metadata_0.parquet': 2, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=08/language=eng/metadata_0.parquet': 7, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=30/language=eng/metadata_0.parquet': 3, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=06/language=eng/metadata_0.parquet': 4, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=07/language=eng/metadata_0.parquet': 7, '/home/mgrani/.owi/public/legal/0bb8c994-394a-11ef-84f3-0242ac1d0009/year=2023/month=10/day=24/year=2023/month=12/day=05/language=eng/metadata_0.parquet': 2}

We can alos plot the count.

import pandas as pd
import matplotlib.pyplot as plt

def plot_de_url_counts(de_url_counts):
    """
    Converts the de_url_counts dictionary into a pandas DataFrame and plots it.

    Args:
    de_url_counts (dict): A dictionary with the file path as the key and the count of .de URLs as the value.
    """
    # Convert the dictionary into a pandas DataFrame
    df = pd.DataFrame(list(de_url_counts.items()), columns=['File', 'DE URL Count'])
    df['File'] = df['File'].apply(lambda x: x[-20:])
    # Plot the DataFrame
    plt.figure(figsize=(10, 6))
    plt.bar(df['File'], df['DE URL Count'])
    plt.xlabel('File')
    plt.ylabel('DE URL Count')
    plt.title('DE URL Count per File')
    plt.xticks(rotation=90)  # Rotate the x-axis labels for better readability
    plt.tight_layout()  # Ensure the labels fit within the figure
    plt.show()

# Example usage:
parquet_files = collect_parquet_files(directory, pattern)
de_url_counts = count_de_urls(parquet_files[0:100])
plot_de_url_counts(de_url_counts)
../../_images/3c98c033cef9a29619e252049ae3ba9d1745d4f02b44eb43bd628a23dcaeea39.png

Filtering for content#

While it is not the most efficient way, we can filter out content using a regular expression. For example, we could scan all impressums that contain Germany in the text and create a separate dataframe from it and plot the dataframe

import pandas as pd
import re

def get_filtered_pages(parquet_files, filter_regex=r'.*Germany.*'):
    """
    Returns a DataFrame with the filtered pages non-aggregated.

    Args:
    parquet_files (list): A list of parquet file paths.

    Returns:
    pandas.DataFrame: A DataFrame with the filtered pages non-aggregated.
    """
    filtered_pages = []
    for file in parquet_files:
        try:
            # Load the parquet file as a pandas DataFrame
            df = pd.read_parquet(file)

            # Use a regular expression to filter pages containing 'Germany' in the 'url' or 'title' fields
            filtered_df = df[(df['url'].str.contains(filter_regex, regex=True)) | 
                              (df['title'].str.contains(filter_regex, regex=True))| 
                              (df['plain_text'].str.contains(filter_regex, regex=True))]

            # Add the filtered pages to the list
            filtered_pages.append(filtered_df)
        except Exception as e:
            print(f"Error reading file {file}: {e}")
    
    # Concatenate the filtered pages
    filtered_pages_df = pd.concat(filtered_pages, ignore_index=True)
    
    return filtered_pages_df

# Example usage:
parquet_files = collect_parquet_files(directory, pattern)
filtered_pages_df = get_filtered_pages(parquet_files[0:400])
print(filtered_pages_df)
                                                    url domain_label  \
0                   https://mezger.eu/impressum_en.html         None   
1     https://www.corporate-office-contacts.com/kiewit/         None   
2     https://chronic.news/big-pharma-loses-billions...         None   
3            https://germanyworks.com/helpdesk/contact/         None   
4       https://www.flyuia.com/sa/en/contacts/corporate     Regional   
...                                                 ...          ...   
8421  https://beabongiasca.com/pages/shipping-repair...         None   
8422  https://caiacosmetics.co.uk/uk/info/terms-and-...         None   
8423  https://www.valmet.com/about-us/contact-us/aut...     Business   
8424       https://www.glocomms.de/contact-us/frankfurt         None   
8425          https://actlegal.com/hu/locations/hungary         None   

                                                  title  \
0                              Impressum - mezger.eu_EN   
1     Kiewit Corporate Office [Contact: Phone Number...   
2     Big Pharma Loses Billions with Each State That...   
3                              Contact - GERMANY WORKS.   
4                Corporate clients – UIA (Saudi Arabia)   
...                                                 ...   
8421  \n    \n    Terms & Conditions\n    \n    \n  ...   
8422                               Terms and Conditions   
8423                            Automation contact form   
8424                                          Frankfurt   
8425                                Hungary - act legal   

                                             plain_text  
0       • de\n  • fr\n\nmezger.eu_EN\n\nSkip navigat...  
1     Skip to content\nCorporate Office Contacts\n\n...  
2     Sunday, May 28 2023\nLatest\n  • Health Canada...  
3       • Industries\n  • Decarbonisation\n  • Downl...  
4     AGREE\n\nWE VALUE YOUR PRIVACY\n\nOur website ...  
...                                                 ...  
8421  Skip to content\nThis site has limited support...  
8422  Current country\nUK UK GBP\n  • Sweden Sweden\...  
8423  Valmet - Forward\nLogin icon-login\nicon-close...  
8424  Glocomms DE\n  • Search for jobs\n      • Germ...  
8425  Bán & Partners\n\n  • Főoldal\n  • Hírek\n  • ...  

[8426 rows x 4 columns]
# Create DataFrame
df = pd.DataFrame(filtered_pages_df)

# Count the 'domain_label' occurrences
domain_label_counts = df['domain_label'].value_counts(dropna=False)

# Plot the histogram
plt.figure(figsize=(8, 6))
domain_label_counts.plot(kind='bar')
plt.xlabel('Domain Label')
plt.ylabel('Count')
plt.title('Histogram of Domain Label Counts')
plt.xticks(rotation=45)
plt.show()
../../_images/2e7cd5704143155f4cc830ed93e379618875a6a2b9217cfe649b34570cdc5eef.png