Tutorial 14: Finetuning an LLM with OWI data using the LUMI supercomputer#

This tutorial demonstrates one way to use OWI data to finetune a large language model (LLM) on the LUMI supercomputer. In this example Finnish language data is downloaded and used to finetune Meta’s Llama-3.2-1B, improving its performance on Finnish language tasks.

This tutorial has these steps:

  1. Get started on using LUMI.

  2. Create a Singularity container for data downloading with the owilix command line tool and download the data using this container.

  3. Preprocess the data and prepare it for training using Jupyter notebooks within the Jupyter environment provided by the LUMI web interface.

  4. Create a second Singularity container optimized for training with the necessary Python packages for machine learning.

  5. Create a batch job and write Python scripts for training and inference using the training container.

OWI License

Before using any data, you must review the terms under the license.
The model trained with OWI data may only be used for research purposes.

Disclaimer

Please note that this is a technical guide only and does not constitute a legal assessment of whether or how you may use the data.

This work uses index files as part of the index partition created by the OpenWebSearch.eu project that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070014 (OpenWebSearch.EU).

1. Get started on using LUMI#

To begin using the LUMI supercomputer, follow these steps (requires valid project):

  1. Get a user account

  2. Set up an SSH key pair to be able to use LUMI from a terminal

  3. Log in to LUMI with SSH client

ssh -i <path-to-private-key> <username>@lumi.csc.fi

1.1. Where to store data - disk areas#

Each user has a home directory ($HOME) that can contain up to 20 GB of data. Do not use this for the data and codes - use /project or /scratch instead. See more about different disk ares here: https://docs.lumi-supercomputer.eu/storage/#where-to-store-data

1.2. Installing Python packages#

Installing packages directly via pip or conda is not recommended as it puts lots of strain on the Lustre file system on LUMI. Instead, users should use Singularity/Apptainer containers. Please also see the official guidance on how to install new Python packages on the LUMI software guide.

2. Get the data#

Let’s create a Singularity container using cotainr in order to download the data using the owilix command line tool.

2.1. Container for owilix#

First, we will specify packages to be installed in a conda environment .yml file. Then we will use cotainr to build a new container with defined packages. In this case we need owilix (requires 3.10 or 3.11 Python) and py4lexis.

Create owilix_env.yml file:

name: owilix_env
channels:
  - conda-forge
dependencies:
  - python=3.11
  - pip=24.0
  - pip:
    - --extra-index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
    - --extra-index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
    - py4lexis
    - owilix

In the terminal of LUMI (note: building container takes several minutes):

# Get needed modules
module purge
module load LUMI
module load cotainr

# Use cotainr to build the container 
cotainr build owilix_env.sif --system=lumi-g --conda-env=owilix_env.yml --accept-license

## Add required additional bindings
module use /appl/local/containers/ai-modules/
module load singularity-AI-bindings 

# Verify installation
singularity exec owilix_env.sif bash -c 'pip list'

2.2. Use owilix to download the data#

In this example, we will download latest Finnish data. We’ll open shell connection to the container and use commands to download the data to the desired directory.

  1. Run a shell within the container:

singularity shell owilix_env.sif

  1. Download the data using owilix (remember to set the target directory!):

owilix --target <target-directory-for-the-data> remote pull all:latest#30 files="**/language=fin/*"

Complete the authentication:

You will be prompted to accept the terms by typing yes. Then copy the web address that appears in the terminal, open it in your browser, and log in to complete the authentication process.

3. Preprocess data#

Now we are ready to preprocess the data using Jupyter. Note that preprocessing demonstrated in this tutorial is minimal. You should consider more thorough preprocessing for your specific use case!

3.1. Activate a Jupyter session#

We’ll use the Jupyter environment provided by LUMI:

    1. Navigate to Apps -> Jupyter

    1. Configure the session with the following settings:

    • Project: project_XXXXXX

    • Partition: small

    • Number of CPU cores: 64

    • Memory (GiB): 128

    • Working directory: Select from the dropdown

    • Python: pytorch

    1. Wait for you session to be ready, then click Connect to Jupyter

Once connected, create a notebook and proceed with the preprocessing steps.

3.2. Combine all data to df#

Next, we’ll load and combine all downloaded data files into a single pandas DataFrame. After all the preprocessing steps, we’ll save the result as a .parquet file.

path_to_owilix_data = '<target-directory-for-the-data-from-owilix/**/*.parquet>'
import pandas as pd
import os
import glob
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pandas as pd
      2 import os
      3 import glob

ModuleNotFoundError: No module named 'pandas'
parquet_files = glob.glob(os.path.join(path_to_owilix_data, "**/*.parquet"), recursive=True)

dataframes = []
for file in parquet_files:
    try:
        df = pd.read_parquet(file)
        dataframes.append(df)
    except Exception as e:
        print(f"Error reading {file}: {e}")

# Combine all DataFrames
combined_df = pd.concat(dataframes, ignore_index=True)
print(f"Combined DataFrame shape: {combined_df.shape}")
Combined DataFrame shape: (1256577, 43)
combined_df.columns 
Index(['id', 'record_id', 'title', 'main_content', 'json-ld', 'microdata',
       'opengraph', 'warc_date', 'warc_ip', 'url', 'url_scheme', 'url_path',
       'url_params', 'url_query', 'url_fragment', 'url_subdomain',
       'url_domain', 'url_suffix', 'url_is_private', 'mime_type', 'charset',
       'content_type_other', 'http_server', 'valid', 'warc_file',
       'warc_offset', 'schema_metadata', 'ows_canonical', 'ows_resource_type',
       'ows_curlielabel', 'ows_index', 'ows_genai', 'ows_genai_details',
       'ows_fetch_response_time', 'ows_fetch_num_errors', 'outgoing_links',
       'image_links', 'video_links', 'iframes', 'curlielabels',
       'curlielabels_en', 'address', 'plain_text'],
      dtype='object')

3.3. Filter the content#

In this step, we will combine data from both the plain_text and main_content columns as this column was renamed in Schema version 0.2.X. For more information about colums see the Preprocessing Pipeline documentation.

column

description

Schema version

plain_text

Cleaned text from the HTML

0.1.X

main_content

Main content of the HTML, formatted with minimal HTML tags (h1-6, p, ul/ol/li, pre, andatags)

0.2.X

We will then proceed with the following steps:

  1. Filter rows where ows_genai==True

  2. Remove duplicates based on main_content and url

  3. Filter and clean the main_content

  4. Drop duplicates again after cleaning

  5. Filter by word count

  6. Double-check the language with langdetect

3.3.1. Use ows_genai and drop duplicates#

Prepare the downloaded OWI data for training by:

  • Combining content fields and removing empty entries

  • Filtering for GenAI-suitable content (ows_genai = True)

  • Removing duplicates by content and URL

  • Selecting final columns: title, url, main_content

Progress is tracked by printing dataset shape after each step.

print(f"DataFrame shape before first steps: {combined_df.shape}")

# Fill missing values in 'main_content' with values from 'plain_text'
combined_df['main_content'] = combined_df['main_content'].fillna(combined_df['plain_text']) 

# Drop rows where 'main_content' is still missing and remove the now-unneeded 'plain_text' column
combined_df = combined_df[combined_df["main_content"].notna()].drop(columns=["plain_text"])
print(f"DataFrame shape after combining main_content and plain_text: {combined_df.shape}")

# Keep only rows where 'ows_genai' is True
combined_df = combined_df[combined_df['ows_genai'] == True]
print(f"DataFrame shape after ows_genai: {combined_df.shape}")

# Remove duplicate rows based on 'main_content', then remove duplicates based on 'url'
combined_df= combined_df.drop_duplicates(subset='main_content') # .drop_duplicates(subset='url')
print(f"DataFrame shape after dropping dups (main_content): {combined_df.shape}")
combined_df= combined_df.drop_duplicates(subset='url')
print(f"DataFrame shape after dropping dups (url): {combined_df.shape}")

# Select only the relevant columns for further processing
combined_df = combined_df[['title','url','main_content']]
print(f"DataFrame shape after all steps: {combined_df.shape}")
DataFrame shape before first steps: (1256577, 43)
DataFrame shape after combining main_content and plain_text: (1256577, 42)
DataFrame shape after ows_genai: (1248081, 42)
DataFrame shape after dropping dups (main_content): (466222, 42)
DataFrame shape after dropping dups (url): (220210, 42)
DataFrame shape after all steps: (220210, 3)
combined_df.head(3)
title url main_content
0 forum.bomber.fi - Omat asetukset - Käyttöehdot https://www.bomber.fi/forums/user/terms?sid=cd... <h2>forum.bomber.fi - Käyttöehdot</h2>\n\n<p>K...
1 Yhteystiedot - Mustasaaren seurakuntayhtymä https://www.mustasaarenseurakuntayhtyma.fi/yht... <h1>Yhteystiedot</h1>\n\n<p> </p>\n\n<h4>Musta...
2 VAELLUSNET - Vaellusturinat II - Omat asetukse... http://www.vaellusnet.com/ucp.php?mode=terms&s... <h2>VAELLUSNET - Vaellusturinat II - Käyttöehd...

3.3.2. Filter html content#

This code performs minimal cleaning of the main_content field. You can define terms (like policy-related keywords) in POLICY_TERMS to exclude pages entirely.

The function performs the following:

  • Removes <a> tags but keeps the inner text

  • Replaces block-level HTML tags (<p>, <h1>–<h6>, etc.) and <br> with newlines

  • Cleans up HTML entities and removes bullet symbols

  • Filters out short or incomplete lines (e.g. no punctuation, too few words)

  • Normalizes whitespace and joins the cleaned lines into a final text block

  • Returns None if no meaningful content remains

import re
import html

# Terms to exclude early (e.g., policy pages)
POLICY_TERMS = ["käyttöeh"]

# Precompiled regex patterns
A_TAG = re.compile(r'<a\b[^>]*?>(.*?)</a>', flags=re.IGNORECASE | re.DOTALL)
BLOCK_TAGS = re.compile(r'</?(h[1-6]|p|pre|ul|ol|li|div)>', flags=re.IGNORECASE)
BR_TAG = re.compile(r'<br\s*/?>', flags=re.IGNORECASE)
TAG_CLEANER = re.compile(r'<[^>]+>')  # fallback to remove leftover tags

TERMINAL_PUNCT_PATTERN = re.compile(r'[.!?]\s*$')
WHITESPACE_PATTERNS = {
    'multiple_newlines': re.compile(r"\n{3,}"),
    'spaces': re.compile(r"[ \t]+"),
    'trailing_spaces': re.compile(r" +\n")
}

def clean_html_min(html_str: str):
    if not html_str or not html_str.strip():
        return None

    # Early policy term check
    html_lower = html_str.lower()
    if any(term in html_lower for term in POLICY_TERMS):
        return None

    # Unwrap <a> tags but keep inner text
    html_str = A_TAG.sub(r'\1', html_str)

    # Replace <br> and block-level tags with newlines
    html_str = BR_TAG.sub('\n', html_str)
    html_str = BLOCK_TAGS.sub('\n', html_str)

    # Remove all remaining tags (non-block level)
    html_str = TAG_CLEANER.sub('', html_str)

    # Decode HTML entities (e.g. &quot; → ", &nbsp; → space)
    html_str = html.unescape(html_str)
    html_str = html_str.replace('\xa0', ' ')  # additional non-breaking space cleanup

    # Remove common bullet symbols
    html_str = re.sub(r'[•◦\u2022]', '', html_str)

    # Normalize and filter lines
    lines = [line.strip() for line in html_str.split('\n') if line.strip()]
    cleaned_lines = []

    for line in lines:
        # Must end in terminal punctuation
        if not TERMINAL_PUNCT_PATTERN.search(line):
            continue

        # Must be long enough
        if len(line) < 20 or len(line.split()) < 4:
            continue

        cleaned_lines.append(line)

    if not cleaned_lines:
        return None

    # Join and normalize whitespace
    cleaned_text = '\n'.join(cleaned_lines)
    cleaned_text = WHITESPACE_PATTERNS['multiple_newlines'].sub("\n\n", cleaned_text)
    cleaned_text = WHITESPACE_PATTERNS['spaces'].sub(" ", cleaned_text)
    cleaned_text = WHITESPACE_PATTERNS['trailing_spaces'].sub("\n", cleaned_text)

    return cleaned_text.strip() if cleaned_text.strip() else None

3.3.3. Example of a site before preprocessing#

test = df['main_content'].iloc[10]  
print(test)
<a href="#bodyContent">Siirry sisältöön</a>

<h1>Kae Araki</h1>

Wikipediasta

<p>Kae Araki (<a href="/wiki/Japanin_kieli">jap.</a> 荒木香恵, oikealta nimeltään Kae Abe, s. <a href="/wiki/6._marraskuuta">6. marraskuuta</a> <a href="/wiki/1966">1966</a> <a href="/wiki/Osaka">Osaka</a>) on <a href="/wiki/Japani">japanilainen</a> <a href="/wiki/Seiy%C5%AB">ääninäyttelijä</a>, <a href="/wiki/Seiy%C5%AB">seiyū</a>, joka on näytellyt monissa <a href="/wiki/Anime">anime</a>- ja <a href="/wiki/Televisio">televisiosarjoissa</a>, muun muassa <a href="/wiki/Babar">Babar</a>, <a href="/wiki/Cardcaptor_Sakura">Cardcaptor Sakura</a>, <a href="/wiki/Digimon">Digimon</a>, <a href="/w/index.php?title=Fushigi_y%C5%ABgi&amp;action=edit&amp;redlink=1">Fushigi yūgi</a>, <a href="/wiki/Great_Teacher_Onizuka">Great Teacher Onizuka</a>, <a href="/wiki/Kodomo_no_omocha">Kodomo no omocha</a>, <a href="/w/index.php?title=Wakakusa_monogatari_%E2%80%93_Nan_to_Jo_no_sensei&amp;action=edit&amp;redlink=1">Wakakusa monogatari – Nan to Jo no sensei</a> ja <a href="/wiki/Pok%C3%A9mon">Pokémon</a>. Animesarjojen lisäksi hän on esiintynyt monissa peleissä. </p>

<h2>Aiheesta muualla</h2>

[<a href="/w/index.php?title=Kae_Araki&amp;veaction=edit&amp;section=1">muokkaa</a> | <a href="/w/index.php?title=Kae_Araki&amp;action=edit&amp;section=1">muokkaa wikitekstiä</a>]
<ul>
  <li><a href="https://www.imdb.com/name/nm0032890/">Kae Araki</a> Internet Movie Databasessa. (englanniksi)</li>
</ul>
Tämä <a href="/wiki/N%C3%A4yttelij%C3%A4">näyttelijään</a> liittyvä artikkeli on <a href="/wiki/Wikipedia:Tynk%C3%A4">tynkä</a>. Voit auttaa Wikipediaa <a href="https://fi.wikipedia.org/w/index.php?title=Kae_Araki&amp;veaction=edit">laajentamalla</a> artikkelia.<br>

3.3.4. Example of the site after preprocessing#

This short example illustrates how the HTML cleaning code works.

res = clean_html_min(test)
print(res)
Kae Araki (jap. 荒木香恵, oikealta nimeltään Kae Abe, s. 6. marraskuuta 1966 Osaka) on japanilainen ääninäyttelijä, seiyū, joka on näytellyt monissa anime- ja televisiosarjoissa, muun muassa Babar, Cardcaptor Sakura, Digimon, Fushigi yūgi, Great Teacher Onizuka, Kodomo no omocha, Wakakusa monogatari – Nan to Jo no sensei ja Pokémon. Animesarjojen lisäksi hän on esiintynyt monissa peleissä.
Tämä näyttelijään liittyvä artikkeli on tynkä. Voit auttaa Wikipediaa laajentamalla artikkelia.
# Using apply with progress tracking:
from tqdm import tqdm
tqdm.pandas(desc="Cleaning HTML content")
combined_df['cleaned_html_content'] = combined_df['main_content'].progress_map(clean_html_min)
combined_df
Cleaning HTML content: 100%|██████████| 220210/220210 [00:53<00:00, 4142.45it/s] 
title url main_content cleaned_html_content
0 forum.bomber.fi - Omat asetukset - Käyttöehdot https://www.bomber.fi/forums/user/terms?sid=cd... <h2>forum.bomber.fi - Käyttöehdot</h2>\n\n<p>K... None
1 Yhteystiedot - Mustasaaren seurakuntayhtymä https://www.mustasaarenseurakuntayhtyma.fi/yht... <h1>Yhteystiedot</h1>\n\n<p> </p>\n\n<h4>Musta... None
2 VAELLUSNET - Vaellusturinat II - Omat asetukse... http://www.vaellusnet.com/ucp.php?mode=terms&s... <h2>VAELLUSNET - Vaellusturinat II - Käyttöehd... None
3 Gives me some privacy | Dekottaa http://www.dekottaa.com/2014/01/gives-me-some-... <h2>26.1.2014</h2>\n\n<a href="">\n\n<h1> Give... Liitutaulutarra kanan muodossa. Jos ei halua j...
4 Suomen Briard ry - Lähetä sähköpostia http://www.suomenbriard.net/phpBB/memberlist.p... <h2>Yhteystiedot käyttäjälle</h2>\n\nYlläpitäj... Tämä viesti lähetetään pelkkänä tekstinä. Älä ...
... ... ... ... ...
1091706 Vastauspalvelu https://vastauspalvelu.omataloyhtio.fi/ <a href="https://jurinet.fi/">Jurinet</a>\nKuv... Taloyhtiömme on asennettu uusi juuri ilmanpois...
1091814 Sound Particles Studio-ohjelmistot - Pikalatau... https://www.muziker.fi/sound-particles-studio-... <p> Valitse maa, johon lähetys toimitetaan </p... None
1091907 Lattialämmityskaapelit - Hammarin Sähkö Oy https://www.hammarinsahko.fi/sahkotarvikkeet/l... Luotettavaa kauppaa yli 110 vuotta\n\n<h2>Latt... Lattialämmityskaapelit varaavaan lattialämmity...
1091934 Kotitalousvähennyslaskuri 2025: Laske kotitalo... https://vertaakorkoja.fi/kotitalousvahennyslas... <h1>Kotitalousvähennyslaskuri</h1>\n\n<p>Kotit... Kotitalousvähennyslaskurin avulla voit laskea ...
1092309 Ota meihin yhteyttä – Mothersusurrus.com https://mothersusurrus.com/ota-meihin-yhteytta/ <h1>Ota meihin yhteyttä</h1>\n\n<h4>Mikäli sin... Mikäli sinulla on kysyttävää musiikista, tai h...

220210 rows × 4 columns

3.3.3. Drop duplicates and None-values#

print(f"Df shape before.: {combined_df.shape}")
combined_df = combined_df.drop_duplicates(subset='cleaned_html_content')
combined_df = combined_df.dropna(subset=['cleaned_html_content'])

print(f"Df shape after: {combined_df.shape}")
Df shape before.: (220210, 4)
Df shape after: (139074, 4)

3.3.4. Filter by word count#

Next, we calculate the word count for each content entry and filter out any entries with fewer than 30 words.

# Calculate word count for each entry
combined_df['word_count'] = combined_df['cleaned_html_content'].str.split().str.len()

# Sort by word count and reset the index
combined_df = combined_df.sort_values(by='word_count').reset_index(drop=True)
combined_df.head(3)
title url main_content cleaned_html_content word_count
0 Kyky – Welcome https://kyky.today/ Kyky Kyky\n • Ota yhteyttä\n • Rekisteröidy\... Ostaja maksaa sinulle suoraan! 4
1 Gluteeniton ruoka - Upbeat Intl. Trading Oy https://www.east-asia-mart.fi/fi/tuoteryhma/23... |\n • e-Lahjakortit ja Onnenkassit (Fukubukur... 300 g Laatikko, Singapore. 4
2 Tietoja sivusta ”C. S. Lewis” – ApoWiki https://apowiki.fi/index.php?action=info&title... Anonyymi\n\nEt ole kirjautunut\n\n • Keskuste... Katso tämän sivun suojausloki. 4
combined_df.tail(3)
title url main_content cleaned_html_content word_count
139071 Ortodoksinen oppi pelastuksesta – Tsasounan su... https://www.tsasouna.net/FI/2024/09/07/ortodok... Skip to content\nTsasounan suunnalta\n\n • Or... Q & A – kysy papilta!\nQ & A – Mikä ja miksi?\... 69952
139072 Vuosikirja 2021 - Cockerspanielit ry https://cockerspanielit.org/vuosikirja-2022-2/ <h1>Vuosikirja 2021</h1>\n\n<p>Koostanut Pirjo... Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ... 73949
139073 vierailija, tekijä sivustolla Hiiltä ja timanttia https://blogit.metropolia.fi/hiilta-ja-timantt... Hyppää sisältöön\nMetropolian Blogit\n • Uusi... Verkko-opetus on tullut jäädäkseen, mutta mite... 85067
combined_df['word_count'].describe()
count    139074.000000
mean        339.775587
std         962.370831
min           4.000000
25%          52.000000
50%         148.000000
75%         345.000000
max       85067.000000
Name: word_count, dtype: float64
print(f"Df shape before.: {combined_df.shape}")
combined_df = combined_df[combined_df['word_count'] > 30]
print(f"Df shape after: {combined_df.shape}")
Df shape before.: (139074, 5)
Df shape after: (117133, 5)

3.3.5. Detect language#

Let’s use the langdetect library to double-check the language of each entry and keep only those written in Finnish. This can take few minutes. Then, filter the dataset to include only Finnish-language entries:

from langdetect import detect, LangDetectException

def detect_language_or_none(text):
    try:
        return detect(text)
    except LangDetectException:
        return None
combined_df['language_detected'] = combined_df['cleaned_html_content'].map(detect_language_or_none)

combined_df
title url main_content cleaned_html_content word_count language_detected
21941 4.12.2024 -Työturvallisuuskoulutus - CadSa https://cadsa.fi/koulutuskalenteri/tyoturvalli... <h1>4.12.2024 -Työturvallisuuskoulutus</h1>\n\... Työturvallisuuskoulutus on työturvakeskuksen k... 31 fi
21942 Huulipuna unohtu https://huulipunaunohtu.blogspot.com/ Siirry pääsisältöön\n\nHuulipuna unohtu\n\nÄit... Äiti on pitänyt meistä huolta, nyt me pidämme ... 31 fi
21943 maa-artisokkapikkelsi | Olemme puutarhassa http://olemmepuutarhassa.fi/tag/maa-artisokkap... maa-artisokkapikkelsi | Olemme puutarhassa\n\n... Heti kun maa on sulanut voi esiin kaivaa viime... 31 fi
21944 REIDEN LOITONTAJALAITE | Ironfit Store https://store.ironfit.fi/product/265/ironfit-r... <p>IRONFIT REIDEN LOITONTAJALAITE ST-6007</p>\... Tilattavissa. Toimitusaika 21 päivää.\nTilatta... 31 fi
21945 Työpenkki Henning, levyn leveys 1500 mm, hylly... https://www.gerdmans.fi/varasto-ja-teollisuus/... <h1> Työpenkki Henning, levyn leveys 1500 mm, ... Työpenkki Henning, levyn leveys 1500 mm, hylly... 31 fi
... ... ... ... ... ... ...
139069 Sanatarkat istuntoselostukset - Keskiviikko 20... https://www.europarl.europa.eu/doceo/document/... \nTakaisin Europarl-portaaliin\n\nChoisissez ... Der Präsident. – Bevor wir zum Tätigkeitsprogr... 63237 de
139070 SKVR https://aineistot.finlit.fi/exist/apps/skvr/ru... Esittely Runoluettelo / Metatietosuodatus Runo... Tällä sivulla voit selata runotyyppejä ja luke... 69370 fi
139071 Ortodoksinen oppi pelastuksesta – Tsasounan su... https://www.tsasouna.net/FI/2024/09/07/ortodok... Skip to content\nTsasounan suunnalta\n\n • Or... Q & A – kysy papilta!\nQ & A – Mikä ja miksi?\... 69952 fi
139072 Vuosikirja 2021 - Cockerspanielit ry https://cockerspanielit.org/vuosikirja-2022-2/ <h1>Vuosikirja 2021</h1>\n\n<p>Koostanut Pirjo... Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ... 73949 fi
139073 vierailija, tekijä sivustolla Hiiltä ja timanttia https://blogit.metropolia.fi/hiilta-ja-timantt... Hyppää sisältöön\nMetropolian Blogit\n • Uusi... Verkko-opetus on tullut jäädäkseen, mutta mite... 85067 fi

117133 rows × 6 columns

combined_df['language_detected'].value_counts()
language_detected
fi    103429
en      9031
sv       999
id       497
de       358
it       353
pl       284
hr       255
et       243
nl       233
fr       200
lt       197
es       140
sl       119
ca       100
da        87
tr        71
cs        70
pt        69
no        56
ro        55
lv        51
ru        46
sk        41
hu        26
mk        24
vi        18
tl        14
sq        14
ko         8
sw         8
ar         6
uk         4
hi         4
bn         4
el         4
te         2
bg         2
cy         2
fa         2
af         2
he         2
ne         2
so         1
Name: count, dtype: int64
# Retain only detected finnish data
print(f"Df shape before.: {combined_df.shape}")
combined_df = combined_df[combined_df['language_detected'] == 'fi']
print(f"Df shape after: {combined_df.shape}")
Df shape before.: (117133, 6)
Df shape after: (103429, 6)

3.4. Save the data to a parquet file#

Now that the data is cleaned, we’re ready to save it. We’ll select only the necessary columns before saving.

# drop the unneccessary columns
combined_df = combined_df[['title', 'cleaned_html_content']]
combined_df.head(2)
title cleaned_html_content
21941 4.12.2024 -Työturvallisuuskoulutus - CadSa Työturvallisuuskoulutus on työturvakeskuksen k...
21942 Huulipuna unohtu Äiti on pitänyt meistä huolta, nyt me pidämme ...
## Save the new dataframe with detected Finnish language
path_to_save_the_data = '<path-here-ending-to-parquet-file-name>' # /scratch is recommended for data files
combined_df.to_parquet(path_to_save_the_data, index= False)

After saving data to a parquet file, you may choose to exit the Jupyter environment or continue working within it to create the upcoming Python and batch job scripts while the session remains active.

4. Finetune the model#

In this step, we’ll train the model using a batch job and a Python script. You can create and edit these files either via the LUMI web interface or by using Visual Studio Code’s Remote SSH extension (for more details, see the documentation here).

The training scripts used in this tutorial are based on the CSCfi/llm-fine-tuning-examples repository.

We will also use MLflow to track training metrics. For a practical example, see the tutorial on using MLflow in Puhti and LUMI.

You can create the necessary files under your project directory, e.g.:

/project/project_46XXXXXXXX/${USER}

Using LLama-models through transformers
If you want to use LLaMA models via the transformers library, follow these steps:

  1. Create Hugging Face account

  2. Locate the LLaMA models, read and accept their terms of use, and wait for approval

  3. Generate an access token on your Hugging Face account

  4. Set the access token in your HF cache directory (HF_HOME), for example:

export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache
mkdir -p $HF_HOME

cd <path-to-HF_HOME>
echo <"token-here"> > token

4.0. Container for training#

First we need to create a compatible environment for training.

This example shows how to use cotainr to build a container with PyTorch configured for LUMI’s AMD GPUs. We’ll follow this approach to create our training container.

Create owilix_env.yml file:

name: training_env
channels:
  - conda-forge
dependencies:
  - filelock=3.13.1
  - fsspec=2024.2.0
  - jinja2=3.1.3
  - markupsafe=2.1.5
  - mpmath=1.3.0
  - networkx=3.2.1
  - numpy=1.26.3
  - pillow=10.2.0
  - pip=24.0
  - python=3.11.7
  - sympy=1.12
  - typing-extensions=4.9.0
  - pip:
    - --extra-index-url https://download.pytorch.org/whl/
    - pytorch-triton-rocm==2.3.1
    - torch==2.3.1+rocm6.0
    - torchaudio==2.3.1+rocm6.0
    - torchvision==0.18.1+rocm6.0
    - langchain==0.3.27
    - mlflow==2.22.0
    - datasets==4.0.0
    - peft==0.17.0
    - transformers==4.55.0

In the terminal of LUMI (note: building container takes several minutes):

# Get needed modules
module purge
module load LUMI
module load cotainr

# Use cotainr to build the container 
cotainr build training_env.sif --system=lumi-g --conda-env=training_env.yml --accept-license

## Add required additional bindings
module use /appl/local/containers/ai-modules/
module load singularity-AI-bindings 

# Verify installation
singularity exec training_env.sif bash -c 'pip list'

4.1. Python scripts for training the model#

Below are the Python scripts used to finetune the Meta Llama-3.2-1B model. They include:

  • Training data preprocessing using a custom preprocess function that chunks and tokenizes the input text - implemented in preprocessing.py

  • Training setup using Hugging Face’s Trainer class - implemented in train.py

  • Metric tracking with MLflow - see train.py

  • Model saving and checkpointing - see train.py

4.1.1. Python script: preprocessing.py#

This script handles text splitting using LangChain’s RecursiveCharacterTextSplitter. It breaks long text inputs into smaller overlapping chunks, optionally appending an end-of-sequence token to each chunk. The script also includes a preprocessing function that tokenizes these chunks with a Hugging Face tokenizer.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_text(text, chunk_size, overlap_size, eos_token):
    """Splits a single large text into smaller overlapping chunks."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap_size,
    )

    chunks = splitter.split_text(text)
    if eos_token:
        chunks = [chunk + f" {eos_token}" for chunk in chunks]

    return chunks
    
    

def preprocess(examples, tokenizer, max_tokens=4096, chunk_size=8192, overlap_size=200):
    """Preprocesses a batch of examples by splitting textcontent into chunks and tokenizing them."""
    all_chunks = []
    for text in examples["cleaned_html_content"]:
        chunks = chunk_text(text, chunk_size=chunk_size, overlap_size=overlap_size, eos_token=tokenizer.eos_token)
        all_chunks.extend(chunks)

    tokenized_output = tokenizer(
        all_chunks,
        padding=False, 
        truncation=True,
        max_length=max_tokens,  
        add_special_tokens=True,
        return_length=False, 
    )

    return {
        "input_ids": tokenized_output["input_ids"],
        "attention_mask": tokenized_output["attention_mask"]
    }

4.1.2. Python script: train.py#

This is the main Python script used to finetune the meta-llama/Llama-3.2-1B model. It uses MLFlow to track training metrics, which are saved in the mlruns folder inside the specified –output-path.

Remember: Make sure to provide the correct path to your training data in Parquet format via the --parquet-file argument, either here or in your batch job script.

Note! By default, the script runs a small test training using only 1,000 samples. To train on the full dataset, comment out these lines and adjust the training parameters accordingly:

    # comment these if you would like to use the whole dataset
    tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))
    tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))

train.py

import argparse
import os
import sys
import time
import mlflow

import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments

from functools import partial
from preprocessing import preprocess

from datasets.utils.logging import disable_progress_bar

if __name__ == "__main__":
    #disable_progress_bar()  #  Disable progress bar during dataset processing

    parser = argparse.ArgumentParser()  #  set up ArgumentParser 
    parser.add_argument(
        "--input-model",
        type=str,
        default="meta-llama/Llama-3.2-1B",
        help="The pre-trained model from Hugging Face to use as basis: https://huggingface.co/models",
    )
    parser.add_argument(
        "--output-path",
        type=str,
        help="Directory where model checkpoints and outputs will be saved.",
    )
    parser.add_argument(
        "--parquet-file",
        type=str,
        #default='<path-to-training-data-in-one-parquet-file>',
        help="Path to the input Parquet file containing training data.",
    )
    parser.add_argument(
        "--model_output_name",
        type=str, 
        help="Name for the finetuned model to be saved under.",
    )
    parser.add_argument("--batch_size", "-b", type=int, default=1, help="Training batch size")
    parser.add_argument(
        "--num-workers",
        type=int,
        default=1,
        help="The number of CPU worker processes to use.",
    )
    parser.add_argument(
        "--resume",
        default=False,
        action="store_true",
        help="If set, continue from a previously interrupted run. Otherwise, overwrite existing checkpoints.",
    )
    parser.add_argument(
        "--max-steps",
        type=int,
        default=400,
        help="The number of training steps.",
    )
    parser.add_argument("--peft", action="store_true", help="Use PEFT: https://huggingface.co/docs/peft/index")
    parser.add_argument(
        "--4bit",
        dest="bnb_4bit",
        action="store_true",
        help="Use 4bit quantization with bitsandbytes: https://huggingface.co/docs/bitsandbytes/main/en/index",
    )
    args, _ = parser.parse_known_args()

    # Check for required arguments
    if not args.model_output_name:
        print("ERROR: --model_output_name must be specified.")
        sys.exit(1)

    # Read the environment variables provided by torchrun
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_world_size = int(os.environ["LOCAL_WORLD_SIZE"])


    # Initialize MLflow only on the main process (rank 0) to prevent multi-process conflicts
    if rank == 0:
        # Set the MLflow tracking URI to save logs and artifacts under the specified output directory
        mlflow_tracking_uri = os.path.join(args.output_path, "mlruns")
        mlflow.set_tracking_uri(mlflow_tracking_uri)

        # Use the model output name as the MLflow experiment name
        mlflow.set_experiment(args.model_output_name)
        print(f"MLflow tracking URI: {mlflow_tracking_uri}")
    

    # this is where trained model and checkpoints will go
    output_model_dir = os.path.join(args.output_path, args.model_output_name)
    
    if rank == 0:
        print(f"Using {world_size} GPUs.")
        print(f"Local {local_world_size} GPUs.")

    # Then we determine the device on which to train the model.
    if rank == 0:
        print("Using PyTorch version:", torch.__version__)
        print(f"world_size: {world_size} GPUs.")
        print(f"local_world_size {local_world_size}")
        print(f"Number of available GPUs (visible to this process): {torch.cuda.device_count()}")
        print(f"Rank: {rank}")
    if torch.cuda.is_available():
        device = torch.device("cuda", local_rank)
        print(f"Using GPU {local_rank}, device name: {torch.cuda.get_device_name(device)}")
    else:
        print(f"No GPU found, using CPU instead. (Rank: {local_rank})")
        device = torch.device("cpu")

    if rank == 0 and args.batch_size % world_size != 0:
        print(f"ERROR: batch_size={args.batch_size} has to be a multiple of the number of GPUs={world_size}!")
        sys.exit(1)


    if rank == 0:
        print(f" output_model_dir: {output_model_dir}")
    
    start = time.time()

    tokenizer = AutoTokenizer.from_pretrained(args.input_model, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
    special_tokens = tokenizer.special_tokens_map
    if rank == 0:
        print("Loading input model and tokenizer")

    quantization_config = None
    if args.bnb_4bit:
        from transformers import BitsAndBytesConfig

        print("Using bnb_4bit")
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_storage=torch.bfloat16,
        )
        quantization_config = bnb_config

    model = AutoModelForCausalLM.from_pretrained(
        args.input_model,
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16,
        device_map=device,
    )

    if args.peft:
        # peft_config = LoraConfig(
        #     task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32,
        #     lora_dropout=0.1
        # )
        # LoRA config from here:
        # https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py#L128
        peft_config = LoraConfig(
            lora_alpha=8,
            lora_dropout=0.05,
            r=16,
            bias="none",
            target_modules="all-linear",
            task_type="CAUSAL_LM",
            # modules_to_save = ["lm_head", "embed_tokens"] # add if you want to use the Llama 3 instruct template
        )
        model = get_peft_model(model, peft_config)
        print("Using PEFT")
        model.print_trainable_parameters()

    stop = time.time()
    if rank == 0:
        print(f"Loading model and tokenizer took: {stop - start:.2f} seconds")
    
    train_batch_size = args.batch_size
    eval_batch_size = args.batch_size

    if rank == 0:
        print(f"Global train and eval batch size : {args.batch_size}")


    training_args = TrainingArguments(
        disable_tqdm=True,
        output_dir=output_model_dir,
        save_strategy="steps",
        save_steps=50, # MODIFY from quick testing to real training for eg. 50 -> 400!!
        save_total_limit=3,
        learning_rate=2e-5, #3e-5,
        weight_decay=0.01,
        bf16=True,  # use 16-bit floating point precision
        per_device_train_batch_size=train_batch_size // world_size,
        per_device_eval_batch_size=eval_batch_size,
        dataloader_num_workers=args.num_workers,
        ddp_find_unused_parameters=False, 
        dataloader_pin_memory=True, 
        metric_for_best_model="eval_loss", 
        eval_strategy="steps",
        eval_steps=100, # MODIFY from quick testing to real training for eg. 100 -> 200!!
        num_train_epochs=2,
        max_steps=args.max_steps, # COMMENT THIS IF using bigger dataset
        
        # MLflow integration 
        report_to=["mlflow"], 
        logging_steps=50, # MODIFY !!
        logging_strategy="steps",
        
        # Run name for MLflow — includes SLURM job ID  to indentify run 
        run_name=f"{args.model_output_name}_{os.environ.get('SLURM_JOB_ID')}",
    )

    #if rank == 0:
        # print(f"Training arguments : {training_args}")

    # Load parquet data
    raw_dataset = load_dataset("parquet", data_files=args.parquet_file)

    # Split dataset into train and validation sets
    split_dataset = raw_dataset["train"].train_test_split(test_size=0.1, seed=42)
    max_tokens = 2048
    overlap_tokens = 50

    if rank == 0:
        print("Dataset columns:", raw_dataset["train"].column_names)
        print(f"Type of column_names: {type(raw_dataset['train'].column_names)}")
    
    column_names = raw_dataset["train"].column_names

    preprocess_function = partial(
        preprocess, tokenizer=tokenizer, max_tokens=max_tokens, chunk_size=8192, overlap_size=overlap_tokens
    )

    tokenized_train_dataset = split_dataset["train"].map(
        preprocess_function,
        batched=True,
        remove_columns=column_names,
        num_proc=args.num_workers,
    )

    tokenized_val_dataset = split_dataset["test"].map(
        preprocess_function,
        batched=True,
        remove_columns=column_names,
        num_proc=args.num_workers,
    )
    ####################################################
    # comment these if you would like to use the whole dataset
    tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))
    tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))

    # Print the sizes to verify
    if rank == 0:
        print(f"Train dataset size: {len(tokenized_train_dataset)}")
        print(f"Validation dataset size: {len(tokenized_val_dataset)}")

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

    # Initialize the Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    start_train = time.time()
    if rank== 0:
        print(f"Training starting...")

    # Train the model - MLflow will automatically log metrics
    trainer.train(resume_from_checkpoint=args.resume)

    stop_train = time.time()
    if rank == 0:
        elapsed = stop_train - start_train
        hours = int(elapsed // 3600)
        minutes = int((elapsed % 3600) // 60)
        seconds = int(elapsed % 60)
        print(f"Finetuning model took: {hours}h {minutes}m {seconds}s")

    # Save the model
    if trainer.is_fsdp_enabled:
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
    trainer.save_model(output_model_dir)
    
    if rank == 0:
        print()
        print("Training done, you can find the final model (and checkpoints) in", output_model_dir)
        print(f"\nMLflow experiment data stored in: {mlflow_tracking_uri}")

4.2. Batch job script for training with 8GPUs#

To run training on LUMI using 8 GPUs, you need to submit a batch job via a SLURM script. Below is an example script named run_train_8gpu.sh.

This script:

  • Requests resources from the GPU partition (eg. dev-g / small-g) including 8 GPUs, 56 CPU cores, and 480 GB of memory.

  • Loads the necessary modules for Singularity container support.

  • Sets environment variables for Hugging Face cache and tokenizer behavior.

  • Defines an output directory for saving the trained model and logs.

  • Launches the training inside the container using torchrun with distributed training support.

Remember to replace <number-here> with your project ID, <path-to-training-data-parquet-file> with the actual path to your preprocessed training data, and <path-to-training-container> with the path to your training container (e.g., training_env.sif). Also, consider switching the partition to small-g and adjusting the --time parameter for longer training runs.

run_train_8gpu.sh

#!/bin/bash
#SBATCH --account=project_<number-here>
#SBATCH --partition=dev-g
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=56
#SBATCH --mem=480G
#SBATCH --time=00:15:00 
#SBATCH --gpus-per-node=8

module use /appl/local/containers/ai-modules
module load singularity-AI-bindings

# This will store all the Hugging Face cache such as downloaded models
# and datasets in the project's scratch folder
export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache
mkdir -p $HF_HOME

# Path to where the trained model and logging data will go
OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data
mkdir -p $OUTPUT_DIR

TRAINING_DATA_FILE=<path-to-training-data-parquet-file>

# Disable internal parallelism of huggingface's tokenizer since we
# want to retain direct control of parallelism options.
export TOKENIZERS_PARALLELISM=false

set -xv  # print the command so that we can verify setting arguments correctly from the logs

CONTAINER=<path-to-training-container>
     
srun singularity exec  $CONTAINER \
    torchrun --standalone \
        --nnodes=1 \
        --nproc-per-node=$SLURM_GPUS_PER_NODE \
        train.py $* \
       --output-path $OUTPUT_DIR \
       --parquet-file $TRAINING_DATA_FILE \
       --model_output_name="Llama-3.2-1B-finetuned" \
       --num-workers $SLURM_CPUS_PER_TASK \
       --batch_size=8

4.3. Run training script#

To train the model on LUMI with 8 GPUs, submit the batch job using the SLURM script provided in run_train_8gpu.sh.

Simply run the following command in the LUMI terminal:

sbatch run_train_8gpu.sh

Once the job starts, a SLURM job file named slurm-{slurm_job_id}.job will be created automatically.

You can monitor the status of your jobs at any time using: sacct.

4.4. Use MLflow to check metrics#

After the training completes, you’ll find the logged data inside the mlruns folder located within your specified output directory. If you didn’t change this part in the run_train_8gpu.sh, the lcoation for mlflow metrics is /scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data/mlruns.

To visualize and monitor your training metrics, you can open an MLflow session via the LUMI web interface. Navigate to Apps -> Mlflow.

Set the Location where MLflow files are stored to the full path where your mlruns folder is located. After launching the session, you can interactively browse training metrics, losses and parameters.

5. Test the model#

After finetuning, you can test the model using a Python script and a SLURM batch job. Inference results will be saved to a logging file for review.

To run inference, simply submit the batch job with: sbatch run_inference.sh

This will generate model outputs for your predefined prompts and log them for inspection.

Note! We don’t need to use the container here since we don’t need any additional packages.

run_inference.sh

#!/bin/bash
#SBATCH --account=project_XXXXXXXXXX
#SBATCH --partition=dev-g
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=7
#SBATCH --mem=60G
#SBATCH --time=0:15:00
#SBATCH --gpus-per-node=1

module purge
module use /appl/local/csc/modulefiles/
module load pytorch/2.5

# This will store all the Hugging Face cache such as downloaded models
# and datasets in the project's scratch folder
export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache
mkdir -p $HF_HOME

export LOG_FILE_PATH=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/inference_logs
mkdir -p $LOG_FILE_PATH
export LOG_FILE=${LOG_FILE_PATH}/inference_prints.log

# Path to where the trained model and logging data will go
OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-data
mkdir -p $OUTPUT_DIR

# Disable internal parallelism of huggingface's tokenizer since we
# want to retain direct control of parallelism options.
export TOKENIZERS_PARALLELISM=false

set -xv  # print the command so that we can verify setting arguments correctly from the logs

MODEL_PATH_1="meta-llama/Llama-3.2-1B"
MODEL_PATH_2="</path/to/your/finetuned/model>"

# Define prompts as an array
PROMPTS=(
  "Tekoälyn kehitys muuttaa maailmaa nopeasti ja siksi "
  "Tervetuloa "
)
# Run inference for each model and prompt combination
for MODEL in "$MODEL_PATH_1" "$MODEL_PATH_2"; do
  for PROMPT in "${PROMPTS[@]}"; do
    srun python inference.py \
      --model "$MODEL" \
      --prompt "$PROMPT"
  done
done

inference.py

import logging
import argparse
import torch
import os

from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer

LOG_FILE = os.environ.get('LOG_FILE')
slurmjob_id = os.environ['SLURM_JOBID']

# logging file settings
logging.basicConfig(
    filename=LOG_FILE,
    level=logging.INFO
)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model",
        type=str,
        help="Path to fine-tuned model directory"
    )
    
    parser.add_argument(
        "--prompt",
        type=str,
        help="Prompt for the LLM to continue"
    )
    args = parser.parse_args()


    logging.info(f"Slurmjob_ID : {slurmjob_id}")
    logging.info(f"Model Path: {args.model}")
    logging.info(f"Prompt: {args.prompt}")

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    print(f"Using device {device}")
    if device.type == 'cuda':
        print(f"Device name is {torch.cuda.get_device_name(device)}")

    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(args.model)
    model.to(device)


    with torch.no_grad():
        inputs = tokenizer(args.prompt, return_tensors='pt').to(device)
        outputs = model.generate(**inputs, do_sample=True, max_length=200, num_return_sequences=2)
        decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

        
        print("Generated Outputs:")
        logging.info("Generated Outputs:")
        for i, text in enumerate(decoded_outputs):
            print(f"\n--- Output {i + 1} ---\n{text}")
            logging.info(f"\n--- Output {i + 1} ---\n{text}")

    logging.info("-" * 40)
    

Thank you for following the tutorial — we hope you found it useful!

For more information on the OpenWebSearch.eu project see: https://openwebsearch.eu/

For more information on the LUMI supercomputer and CSC, see: https://www.lumi-supercomputer.eu/, https://www.csc.fi/