Tutorial 14: Finetuning an LLM with OWI data using the LUMI supercomputer#
This tutorial demonstrates one way to use OWI data to finetune a large language model (LLM) on the LUMI supercomputer. In this example Finnish language data is downloaded and used to finetune Meta’s Llama-3.2-1B, improving its performance on Finnish language tasks.
This tutorial has these steps:
Get started on using LUMI.
Create a Singularity container for data downloading with the owilix command line tool and download the data using this container.
Preprocess the data and prepare it for training using Jupyter notebooks within the Jupyter environment provided by the LUMI web interface.
Create a second Singularity container optimized for training with the necessary Python packages for machine learning.
Create a batch job and write Python scripts for training and inference using the training container.
OWI License
Before using any data, you must review the terms under the license.
The model trained with OWI data may only be used for research purposes.
Disclaimer
Please note that this is a technical guide only and does not constitute a legal assessment of whether or how you may use the data.
This work uses index files as part of the index partition created by the OpenWebSearch.eu project that has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070014 (OpenWebSearch.EU).
1. Get started on using LUMI#
To begin using the LUMI supercomputer, follow these steps (requires valid project):
ssh -i <path-to-private-key> <username>@lumi.csc.fi
1.1. Where to store data - disk areas#
Each user has a home directory ($HOME) that can contain up to 20 GB of data. Do not use this for the data and codes - use /project or /scratch instead. See more about different disk ares here: https://docs.lumi-supercomputer.eu/storage/#where-to-store-data
1.2. Installing Python packages#
Installing packages directly via pip or conda is not recommended as it puts lots of strain on the Lustre file system on LUMI. Instead, users should use Singularity/Apptainer containers. Please also see the official guidance on how to install new Python packages on the LUMI software guide.
2. Get the data#
Let’s create a Singularity container using cotainr in order to download the data using the owilix command line tool.
2.1. Container for owilix#
First, we will specify packages to be installed in a conda environment .yml file. Then we will use cotainr to build a new container with defined packages. In this case we need owilix (requires 3.10 or 3.11 Python) and py4lexis.
Create owilix_env.yml file:
name: owilix_env
channels:
- conda-forge
dependencies:
- python=3.11
- pip=24.0
- pip:
- --extra-index-url https://opencode.it4i.eu/api/v4/projects/107/packages/pypi/simple
- --extra-index-url https://opencode.it4i.eu/api/v4/projects/92/packages/pypi/simple
- py4lexis
- owilix
In the terminal of LUMI (note: building container takes several minutes):
# Get needed modules
module purge
module load LUMI
module load cotainr
# Use cotainr to build the container
cotainr build owilix_env.sif --system=lumi-g --conda-env=owilix_env.yml --accept-license
## Add required additional bindings
module use /appl/local/containers/ai-modules/
module load singularity-AI-bindings
# Verify installation
singularity exec owilix_env.sif bash -c 'pip list'
2.2. Use owilix to download the data#
In this example, we will download latest Finnish data. We’ll open shell connection to the container and use commands to download the data to the desired directory.
Run a shell within the container:
singularity shell owilix_env.sif
Download the data using owilix (remember to set the target directory!):
owilix --target <target-directory-for-the-data> remote pull all:latest#30 files="**/language=fin/*"
Complete the authentication:
You will be prompted to accept the terms by typing yes. Then copy the web address that appears in the terminal, open it in your browser, and log in to complete the authentication process.
3. Preprocess data#
Now we are ready to preprocess the data using Jupyter. Note that preprocessing demonstrated in this tutorial is minimal. You should consider more thorough preprocessing for your specific use case!
3.1. Activate a Jupyter session#
We’ll use the Jupyter environment provided by LUMI:
Navigate to Apps -> Jupyter
Configure the session with the following settings:
Project: project_XXXXXX
Partition: small
Number of CPU cores: 64
Memory (GiB): 128
Working directory: Select from the dropdown
Python: pytorch
Wait for you session to be ready, then click
Connect to Jupyter
Once connected, create a notebook and proceed with the preprocessing steps.
3.2. Combine all data to df#
Next, we’ll load and combine all downloaded data files into a single pandas DataFrame. After all the preprocessing steps, we’ll save the result as a .parquet file.
path_to_owilix_data = '<target-directory-for-the-data-from-owilix/**/*.parquet>'
import pandas as pd
import os
import glob
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pandas as pd
2 import os
3 import glob
ModuleNotFoundError: No module named 'pandas'
parquet_files = glob.glob(os.path.join(path_to_owilix_data, "**/*.parquet"), recursive=True)
dataframes = []
for file in parquet_files:
try:
df = pd.read_parquet(file)
dataframes.append(df)
except Exception as e:
print(f"Error reading {file}: {e}")
# Combine all DataFrames
combined_df = pd.concat(dataframes, ignore_index=True)
print(f"Combined DataFrame shape: {combined_df.shape}")
Combined DataFrame shape: (1256577, 43)
combined_df.columns
Index(['id', 'record_id', 'title', 'main_content', 'json-ld', 'microdata',
'opengraph', 'warc_date', 'warc_ip', 'url', 'url_scheme', 'url_path',
'url_params', 'url_query', 'url_fragment', 'url_subdomain',
'url_domain', 'url_suffix', 'url_is_private', 'mime_type', 'charset',
'content_type_other', 'http_server', 'valid', 'warc_file',
'warc_offset', 'schema_metadata', 'ows_canonical', 'ows_resource_type',
'ows_curlielabel', 'ows_index', 'ows_genai', 'ows_genai_details',
'ows_fetch_response_time', 'ows_fetch_num_errors', 'outgoing_links',
'image_links', 'video_links', 'iframes', 'curlielabels',
'curlielabels_en', 'address', 'plain_text'],
dtype='object')
3.3. Filter the content#
In this step, we will combine data from both the plain_text and main_content columns as this column was renamed in Schema version 0.2.X. For more information about colums see the Preprocessing Pipeline documentation.
column |
description |
Schema version |
|---|---|---|
plain_text |
Cleaned text from the HTML |
0.1.X |
main_content |
Main content of the HTML, formatted with minimal HTML tags ( |
0.2.X |
We will then proceed with the following steps:
Filter rows where
ows_genai==TrueRemove duplicates based on
main_contentandurlFilter and clean the
main_contentDrop duplicates again after cleaning
Filter by word count
Double-check the language with langdetect
3.3.1. Use ows_genai and drop duplicates#
Prepare the downloaded OWI data for training by:
Combining content fields and removing empty entries
Filtering for GenAI-suitable content (
ows_genai = True)Removing duplicates by content and URL
Selecting final columns:
title,url,main_content
Progress is tracked by printing dataset shape after each step.
print(f"DataFrame shape before first steps: {combined_df.shape}")
# Fill missing values in 'main_content' with values from 'plain_text'
combined_df['main_content'] = combined_df['main_content'].fillna(combined_df['plain_text'])
# Drop rows where 'main_content' is still missing and remove the now-unneeded 'plain_text' column
combined_df = combined_df[combined_df["main_content"].notna()].drop(columns=["plain_text"])
print(f"DataFrame shape after combining main_content and plain_text: {combined_df.shape}")
# Keep only rows where 'ows_genai' is True
combined_df = combined_df[combined_df['ows_genai'] == True]
print(f"DataFrame shape after ows_genai: {combined_df.shape}")
# Remove duplicate rows based on 'main_content', then remove duplicates based on 'url'
combined_df= combined_df.drop_duplicates(subset='main_content') # .drop_duplicates(subset='url')
print(f"DataFrame shape after dropping dups (main_content): {combined_df.shape}")
combined_df= combined_df.drop_duplicates(subset='url')
print(f"DataFrame shape after dropping dups (url): {combined_df.shape}")
# Select only the relevant columns for further processing
combined_df = combined_df[['title','url','main_content']]
print(f"DataFrame shape after all steps: {combined_df.shape}")
DataFrame shape before first steps: (1256577, 43)
DataFrame shape after combining main_content and plain_text: (1256577, 42)
DataFrame shape after ows_genai: (1248081, 42)
DataFrame shape after dropping dups (main_content): (466222, 42)
DataFrame shape after dropping dups (url): (220210, 42)
DataFrame shape after all steps: (220210, 3)
combined_df.head(3)
| title | url | main_content | |
|---|---|---|---|
| 0 | forum.bomber.fi - Omat asetukset - Käyttöehdot | https://www.bomber.fi/forums/user/terms?sid=cd... | <h2>forum.bomber.fi - Käyttöehdot</h2>\n\n<p>K... |
| 1 | Yhteystiedot - Mustasaaren seurakuntayhtymä | https://www.mustasaarenseurakuntayhtyma.fi/yht... | <h1>Yhteystiedot</h1>\n\n<p> </p>\n\n<h4>Musta... |
| 2 | VAELLUSNET - Vaellusturinat II - Omat asetukse... | http://www.vaellusnet.com/ucp.php?mode=terms&s... | <h2>VAELLUSNET - Vaellusturinat II - Käyttöehd... |
3.3.2. Filter html content#
This code performs minimal cleaning of the main_content field. You can define terms (like policy-related keywords) in POLICY_TERMS to exclude pages entirely.
The function performs the following:
Removes
<a>tags but keeps the inner textReplaces block-level HTML tags (
<p>,<h1>–<h6>, etc.) and<br>with newlinesCleans up HTML entities and removes bullet symbols
Filters out short or incomplete lines (e.g. no punctuation, too few words)
Normalizes whitespace and joins the cleaned lines into a final text block
Returns
Noneif no meaningful content remains
import re
import html
# Terms to exclude early (e.g., policy pages)
POLICY_TERMS = ["käyttöeh"]
# Precompiled regex patterns
A_TAG = re.compile(r'<a\b[^>]*?>(.*?)</a>', flags=re.IGNORECASE | re.DOTALL)
BLOCK_TAGS = re.compile(r'</?(h[1-6]|p|pre|ul|ol|li|div)>', flags=re.IGNORECASE)
BR_TAG = re.compile(r'<br\s*/?>', flags=re.IGNORECASE)
TAG_CLEANER = re.compile(r'<[^>]+>') # fallback to remove leftover tags
TERMINAL_PUNCT_PATTERN = re.compile(r'[.!?]\s*$')
WHITESPACE_PATTERNS = {
'multiple_newlines': re.compile(r"\n{3,}"),
'spaces': re.compile(r"[ \t]+"),
'trailing_spaces': re.compile(r" +\n")
}
def clean_html_min(html_str: str):
if not html_str or not html_str.strip():
return None
# Early policy term check
html_lower = html_str.lower()
if any(term in html_lower for term in POLICY_TERMS):
return None
# Unwrap <a> tags but keep inner text
html_str = A_TAG.sub(r'\1', html_str)
# Replace <br> and block-level tags with newlines
html_str = BR_TAG.sub('\n', html_str)
html_str = BLOCK_TAGS.sub('\n', html_str)
# Remove all remaining tags (non-block level)
html_str = TAG_CLEANER.sub('', html_str)
# Decode HTML entities (e.g. " → ", → space)
html_str = html.unescape(html_str)
html_str = html_str.replace('\xa0', ' ') # additional non-breaking space cleanup
# Remove common bullet symbols
html_str = re.sub(r'[•◦\u2022]', '', html_str)
# Normalize and filter lines
lines = [line.strip() for line in html_str.split('\n') if line.strip()]
cleaned_lines = []
for line in lines:
# Must end in terminal punctuation
if not TERMINAL_PUNCT_PATTERN.search(line):
continue
# Must be long enough
if len(line) < 20 or len(line.split()) < 4:
continue
cleaned_lines.append(line)
if not cleaned_lines:
return None
# Join and normalize whitespace
cleaned_text = '\n'.join(cleaned_lines)
cleaned_text = WHITESPACE_PATTERNS['multiple_newlines'].sub("\n\n", cleaned_text)
cleaned_text = WHITESPACE_PATTERNS['spaces'].sub(" ", cleaned_text)
cleaned_text = WHITESPACE_PATTERNS['trailing_spaces'].sub("\n", cleaned_text)
return cleaned_text.strip() if cleaned_text.strip() else None
3.3.3. Example of a site before preprocessing#
test = df['main_content'].iloc[10]
print(test)
<a href="#bodyContent">Siirry sisältöön</a>
<h1>Kae Araki</h1>
Wikipediasta
<p>Kae Araki (<a href="/wiki/Japanin_kieli">jap.</a> 荒木香恵, oikealta nimeltään Kae Abe, s. <a href="/wiki/6._marraskuuta">6. marraskuuta</a> <a href="/wiki/1966">1966</a> <a href="/wiki/Osaka">Osaka</a>) on <a href="/wiki/Japani">japanilainen</a> <a href="/wiki/Seiy%C5%AB">ääninäyttelijä</a>, <a href="/wiki/Seiy%C5%AB">seiyū</a>, joka on näytellyt monissa <a href="/wiki/Anime">anime</a>- ja <a href="/wiki/Televisio">televisiosarjoissa</a>, muun muassa <a href="/wiki/Babar">Babar</a>, <a href="/wiki/Cardcaptor_Sakura">Cardcaptor Sakura</a>, <a href="/wiki/Digimon">Digimon</a>, <a href="/w/index.php?title=Fushigi_y%C5%ABgi&action=edit&redlink=1">Fushigi yūgi</a>, <a href="/wiki/Great_Teacher_Onizuka">Great Teacher Onizuka</a>, <a href="/wiki/Kodomo_no_omocha">Kodomo no omocha</a>, <a href="/w/index.php?title=Wakakusa_monogatari_%E2%80%93_Nan_to_Jo_no_sensei&action=edit&redlink=1">Wakakusa monogatari – Nan to Jo no sensei</a> ja <a href="/wiki/Pok%C3%A9mon">Pokémon</a>. Animesarjojen lisäksi hän on esiintynyt monissa peleissä. </p>
<h2>Aiheesta muualla</h2>
[<a href="/w/index.php?title=Kae_Araki&veaction=edit&section=1">muokkaa</a> | <a href="/w/index.php?title=Kae_Araki&action=edit&section=1">muokkaa wikitekstiä</a>]
<ul>
<li><a href="https://www.imdb.com/name/nm0032890/">Kae Araki</a> Internet Movie Databasessa. (englanniksi)</li>
</ul>
Tämä <a href="/wiki/N%C3%A4yttelij%C3%A4">näyttelijään</a> liittyvä artikkeli on <a href="/wiki/Wikipedia:Tynk%C3%A4">tynkä</a>. Voit auttaa Wikipediaa <a href="https://fi.wikipedia.org/w/index.php?title=Kae_Araki&veaction=edit">laajentamalla</a> artikkelia.<br>
3.3.4. Example of the site after preprocessing#
This short example illustrates how the HTML cleaning code works.
res = clean_html_min(test)
print(res)
Kae Araki (jap. 荒木香恵, oikealta nimeltään Kae Abe, s. 6. marraskuuta 1966 Osaka) on japanilainen ääninäyttelijä, seiyū, joka on näytellyt monissa anime- ja televisiosarjoissa, muun muassa Babar, Cardcaptor Sakura, Digimon, Fushigi yūgi, Great Teacher Onizuka, Kodomo no omocha, Wakakusa monogatari – Nan to Jo no sensei ja Pokémon. Animesarjojen lisäksi hän on esiintynyt monissa peleissä.
Tämä näyttelijään liittyvä artikkeli on tynkä. Voit auttaa Wikipediaa laajentamalla artikkelia.
# Using apply with progress tracking:
from tqdm import tqdm
tqdm.pandas(desc="Cleaning HTML content")
combined_df['cleaned_html_content'] = combined_df['main_content'].progress_map(clean_html_min)
combined_df
Cleaning HTML content: 100%|██████████| 220210/220210 [00:53<00:00, 4142.45it/s]
| title | url | main_content | cleaned_html_content | |
|---|---|---|---|---|
| 0 | forum.bomber.fi - Omat asetukset - Käyttöehdot | https://www.bomber.fi/forums/user/terms?sid=cd... | <h2>forum.bomber.fi - Käyttöehdot</h2>\n\n<p>K... | None |
| 1 | Yhteystiedot - Mustasaaren seurakuntayhtymä | https://www.mustasaarenseurakuntayhtyma.fi/yht... | <h1>Yhteystiedot</h1>\n\n<p> </p>\n\n<h4>Musta... | None |
| 2 | VAELLUSNET - Vaellusturinat II - Omat asetukse... | http://www.vaellusnet.com/ucp.php?mode=terms&s... | <h2>VAELLUSNET - Vaellusturinat II - Käyttöehd... | None |
| 3 | Gives me some privacy | Dekottaa | http://www.dekottaa.com/2014/01/gives-me-some-... | <h2>26.1.2014</h2>\n\n<a href="">\n\n<h1> Give... | Liitutaulutarra kanan muodossa. Jos ei halua j... |
| 4 | Suomen Briard ry - Lähetä sähköpostia | http://www.suomenbriard.net/phpBB/memberlist.p... | <h2>Yhteystiedot käyttäjälle</h2>\n\nYlläpitäj... | Tämä viesti lähetetään pelkkänä tekstinä. Älä ... |
| ... | ... | ... | ... | ... |
| 1091706 | Vastauspalvelu | https://vastauspalvelu.omataloyhtio.fi/ | <a href="https://jurinet.fi/">Jurinet</a>\nKuv... | Taloyhtiömme on asennettu uusi juuri ilmanpois... |
| 1091814 | Sound Particles Studio-ohjelmistot - Pikalatau... | https://www.muziker.fi/sound-particles-studio-... | <p> Valitse maa, johon lähetys toimitetaan </p... | None |
| 1091907 | Lattialämmityskaapelit - Hammarin Sähkö Oy | https://www.hammarinsahko.fi/sahkotarvikkeet/l... | Luotettavaa kauppaa yli 110 vuotta\n\n<h2>Latt... | Lattialämmityskaapelit varaavaan lattialämmity... |
| 1091934 | Kotitalousvähennyslaskuri 2025: Laske kotitalo... | https://vertaakorkoja.fi/kotitalousvahennyslas... | <h1>Kotitalousvähennyslaskuri</h1>\n\n<p>Kotit... | Kotitalousvähennyslaskurin avulla voit laskea ... |
| 1092309 | Ota meihin yhteyttä – Mothersusurrus.com | https://mothersusurrus.com/ota-meihin-yhteytta/ | <h1>Ota meihin yhteyttä</h1>\n\n<h4>Mikäli sin... | Mikäli sinulla on kysyttävää musiikista, tai h... |
220210 rows × 4 columns
3.3.3. Drop duplicates and None-values#
print(f"Df shape before.: {combined_df.shape}")
combined_df = combined_df.drop_duplicates(subset='cleaned_html_content')
combined_df = combined_df.dropna(subset=['cleaned_html_content'])
print(f"Df shape after: {combined_df.shape}")
Df shape before.: (220210, 4)
Df shape after: (139074, 4)
3.3.4. Filter by word count#
Next, we calculate the word count for each content entry and filter out any entries with fewer than 30 words.
# Calculate word count for each entry
combined_df['word_count'] = combined_df['cleaned_html_content'].str.split().str.len()
# Sort by word count and reset the index
combined_df = combined_df.sort_values(by='word_count').reset_index(drop=True)
combined_df.head(3)
| title | url | main_content | cleaned_html_content | word_count | |
|---|---|---|---|---|---|
| 0 | Kyky – Welcome | https://kyky.today/ | Kyky Kyky\n • Ota yhteyttä\n • Rekisteröidy\... | Ostaja maksaa sinulle suoraan! | 4 |
| 1 | Gluteeniton ruoka - Upbeat Intl. Trading Oy | https://www.east-asia-mart.fi/fi/tuoteryhma/23... | |\n • e-Lahjakortit ja Onnenkassit (Fukubukur... | 300 g Laatikko, Singapore. | 4 |
| 2 | Tietoja sivusta ”C. S. Lewis” – ApoWiki | https://apowiki.fi/index.php?action=info&title... | Anonyymi\n\nEt ole kirjautunut\n\n • Keskuste... | Katso tämän sivun suojausloki. | 4 |
combined_df.tail(3)
| title | url | main_content | cleaned_html_content | word_count | |
|---|---|---|---|---|---|
| 139071 | Ortodoksinen oppi pelastuksesta – Tsasounan su... | https://www.tsasouna.net/FI/2024/09/07/ortodok... | Skip to content\nTsasounan suunnalta\n\n • Or... | Q & A – kysy papilta!\nQ & A – Mikä ja miksi?\... | 69952 |
| 139072 | Vuosikirja 2021 - Cockerspanielit ry | https://cockerspanielit.org/vuosikirja-2022-2/ | <h1>Vuosikirja 2021</h1>\n\n<p>Koostanut Pirjo... | Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ... | 73949 |
| 139073 | vierailija, tekijä sivustolla Hiiltä ja timanttia | https://blogit.metropolia.fi/hiilta-ja-timantt... | Hyppää sisältöön\nMetropolian Blogit\n • Uusi... | Verkko-opetus on tullut jäädäkseen, mutta mite... | 85067 |
combined_df['word_count'].describe()
count 139074.000000
mean 339.775587
std 962.370831
min 4.000000
25% 52.000000
50% 148.000000
75% 345.000000
max 85067.000000
Name: word_count, dtype: float64
print(f"Df shape before.: {combined_df.shape}")
combined_df = combined_df[combined_df['word_count'] > 30]
print(f"Df shape after: {combined_df.shape}")
Df shape before.: (139074, 5)
Df shape after: (117133, 5)
3.3.5. Detect language#
Let’s use the langdetect library to double-check the language of each entry and keep only those written in Finnish. This can take few minutes. Then, filter the dataset to include only Finnish-language entries:
from langdetect import detect, LangDetectException
def detect_language_or_none(text):
try:
return detect(text)
except LangDetectException:
return None
combined_df['language_detected'] = combined_df['cleaned_html_content'].map(detect_language_or_none)
combined_df
| title | url | main_content | cleaned_html_content | word_count | language_detected | |
|---|---|---|---|---|---|---|
| 21941 | 4.12.2024 -Työturvallisuuskoulutus - CadSa | https://cadsa.fi/koulutuskalenteri/tyoturvalli... | <h1>4.12.2024 -Työturvallisuuskoulutus</h1>\n\... | Työturvallisuuskoulutus on työturvakeskuksen k... | 31 | fi |
| 21942 | Huulipuna unohtu | https://huulipunaunohtu.blogspot.com/ | Siirry pääsisältöön\n\nHuulipuna unohtu\n\nÄit... | Äiti on pitänyt meistä huolta, nyt me pidämme ... | 31 | fi |
| 21943 | maa-artisokkapikkelsi | Olemme puutarhassa | http://olemmepuutarhassa.fi/tag/maa-artisokkap... | maa-artisokkapikkelsi | Olemme puutarhassa\n\n... | Heti kun maa on sulanut voi esiin kaivaa viime... | 31 | fi |
| 21944 | REIDEN LOITONTAJALAITE | Ironfit Store | https://store.ironfit.fi/product/265/ironfit-r... | <p>IRONFIT REIDEN LOITONTAJALAITE ST-6007</p>\... | Tilattavissa. Toimitusaika 21 päivää.\nTilatta... | 31 | fi |
| 21945 | Työpenkki Henning, levyn leveys 1500 mm, hylly... | https://www.gerdmans.fi/varasto-ja-teollisuus/... | <h1> Työpenkki Henning, levyn leveys 1500 mm, ... | Työpenkki Henning, levyn leveys 1500 mm, hylly... | 31 | fi |
| ... | ... | ... | ... | ... | ... | ... |
| 139069 | Sanatarkat istuntoselostukset - Keskiviikko 20... | https://www.europarl.europa.eu/doceo/document/... | \nTakaisin Europarl-portaaliin\n\nChoisissez ... | Der Präsident. – Bevor wir zum Tätigkeitsprogr... | 63237 | de |
| 139070 | SKVR | https://aineistot.finlit.fi/exist/apps/skvr/ru... | Esittely Runoluettelo / Metatietosuodatus Runo... | Tällä sivulla voit selata runotyyppejä ja luke... | 69370 | fi |
| 139071 | Ortodoksinen oppi pelastuksesta – Tsasounan su... | https://www.tsasouna.net/FI/2024/09/07/ortodok... | Skip to content\nTsasounan suunnalta\n\n • Or... | Q & A – kysy papilta!\nQ & A – Mikä ja miksi?\... | 69952 | fi |
| 139072 | Vuosikirja 2021 - Cockerspanielit ry | https://cockerspanielit.org/vuosikirja-2022-2/ | <h1>Vuosikirja 2021</h1>\n\n<p>Koostanut Pirjo... | Näyttelyt: Alavus KR 13.6. Jouko Leiviskä AVO ... | 73949 | fi |
| 139073 | vierailija, tekijä sivustolla Hiiltä ja timanttia | https://blogit.metropolia.fi/hiilta-ja-timantt... | Hyppää sisältöön\nMetropolian Blogit\n • Uusi... | Verkko-opetus on tullut jäädäkseen, mutta mite... | 85067 | fi |
117133 rows × 6 columns
combined_df['language_detected'].value_counts()
language_detected
fi 103429
en 9031
sv 999
id 497
de 358
it 353
pl 284
hr 255
et 243
nl 233
fr 200
lt 197
es 140
sl 119
ca 100
da 87
tr 71
cs 70
pt 69
no 56
ro 55
lv 51
ru 46
sk 41
hu 26
mk 24
vi 18
tl 14
sq 14
ko 8
sw 8
ar 6
uk 4
hi 4
bn 4
el 4
te 2
bg 2
cy 2
fa 2
af 2
he 2
ne 2
so 1
Name: count, dtype: int64
# Retain only detected finnish data
print(f"Df shape before.: {combined_df.shape}")
combined_df = combined_df[combined_df['language_detected'] == 'fi']
print(f"Df shape after: {combined_df.shape}")
Df shape before.: (117133, 6)
Df shape after: (103429, 6)
3.4. Save the data to a parquet file#
Now that the data is cleaned, we’re ready to save it. We’ll select only the necessary columns before saving.
# drop the unneccessary columns
combined_df = combined_df[['title', 'cleaned_html_content']]
combined_df.head(2)
| title | cleaned_html_content | |
|---|---|---|
| 21941 | 4.12.2024 -Työturvallisuuskoulutus - CadSa | Työturvallisuuskoulutus on työturvakeskuksen k... |
| 21942 | Huulipuna unohtu | Äiti on pitänyt meistä huolta, nyt me pidämme ... |
## Save the new dataframe with detected Finnish language
path_to_save_the_data = '<path-here-ending-to-parquet-file-name>' # /scratch is recommended for data files
combined_df.to_parquet(path_to_save_the_data, index= False)
After saving data to a parquet file, you may choose to exit the Jupyter environment or continue working within it to create the upcoming Python and batch job scripts while the session remains active.
4. Finetune the model#
In this step, we’ll train the model using a batch job and a Python script. You can create and edit these files either via the LUMI web interface or by using Visual Studio Code’s Remote SSH extension (for more details, see the documentation here).
The training scripts used in this tutorial are based on the CSCfi/llm-fine-tuning-examples repository.
We will also use MLflow to track training metrics. For a practical example, see the tutorial on using MLflow in Puhti and LUMI.
You can create the necessary files under your project directory, e.g.:
/project/project_46XXXXXXXX/${USER}
Using LLama-models through transformers
If you want to use LLaMA models via the transformers library, follow these steps:
Create Hugging Face account
Locate the LLaMA models, read and accept their terms of use, and wait for approval
Generate an access token on your Hugging Face account
Set the access token in your HF cache directory (HF_HOME), for example:
export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache
mkdir -p $HF_HOME
cd <path-to-HF_HOME>
echo <"token-here"> > token
4.0. Container for training#
First we need to create a compatible environment for training.
This example shows how to use cotainr to build a container with PyTorch configured for LUMI’s AMD GPUs. We’ll follow this approach to create our training container.
Create owilix_env.yml file:
name: training_env
channels:
- conda-forge
dependencies:
- filelock=3.13.1
- fsspec=2024.2.0
- jinja2=3.1.3
- markupsafe=2.1.5
- mpmath=1.3.0
- networkx=3.2.1
- numpy=1.26.3
- pillow=10.2.0
- pip=24.0
- python=3.11.7
- sympy=1.12
- typing-extensions=4.9.0
- pip:
- --extra-index-url https://download.pytorch.org/whl/
- pytorch-triton-rocm==2.3.1
- torch==2.3.1+rocm6.0
- torchaudio==2.3.1+rocm6.0
- torchvision==0.18.1+rocm6.0
- langchain==0.3.27
- mlflow==2.22.0
- datasets==4.0.0
- peft==0.17.0
- transformers==4.55.0
In the terminal of LUMI (note: building container takes several minutes):
# Get needed modules
module purge
module load LUMI
module load cotainr
# Use cotainr to build the container
cotainr build training_env.sif --system=lumi-g --conda-env=training_env.yml --accept-license
## Add required additional bindings
module use /appl/local/containers/ai-modules/
module load singularity-AI-bindings
# Verify installation
singularity exec training_env.sif bash -c 'pip list'
4.1. Python scripts for training the model#
Below are the Python scripts used to finetune the Meta Llama-3.2-1B model. They include:
Training data preprocessing using a custom preprocess function that chunks and tokenizes the input text - implemented in preprocessing.py
Training setup using Hugging Face’s
Trainerclass - implemented in train.pyMetric tracking with MLflow - see train.py
Model saving and checkpointing - see train.py
4.1.1. Python script: preprocessing.py#
This script handles text splitting using LangChain’s RecursiveCharacterTextSplitter. It breaks long text inputs into smaller overlapping chunks, optionally appending an end-of-sequence token to each chunk. The script also includes a preprocessing function that tokenizes these chunks with a Hugging Face tokenizer.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_text(text, chunk_size, overlap_size, eos_token):
"""Splits a single large text into smaller overlapping chunks."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap_size,
)
chunks = splitter.split_text(text)
if eos_token:
chunks = [chunk + f" {eos_token}" for chunk in chunks]
return chunks
def preprocess(examples, tokenizer, max_tokens=4096, chunk_size=8192, overlap_size=200):
"""Preprocesses a batch of examples by splitting textcontent into chunks and tokenizing them."""
all_chunks = []
for text in examples["cleaned_html_content"]:
chunks = chunk_text(text, chunk_size=chunk_size, overlap_size=overlap_size, eos_token=tokenizer.eos_token)
all_chunks.extend(chunks)
tokenized_output = tokenizer(
all_chunks,
padding=False,
truncation=True,
max_length=max_tokens,
add_special_tokens=True,
return_length=False,
)
return {
"input_ids": tokenized_output["input_ids"],
"attention_mask": tokenized_output["attention_mask"]
}
4.1.2. Python script: train.py#
This is the main Python script used to finetune the meta-llama/Llama-3.2-1B model.
It uses MLFlow to track training metrics, which are saved in the mlruns folder inside the specified –output-path.
Remember: Make sure to provide the correct path to your training data in Parquet format via the --parquet-file argument, either here or in your batch job script.
Note! By default, the script runs a small test training using only 1,000 samples. To train on the full dataset, comment out these lines and adjust the training parameters accordingly:
# comment these if you would like to use the whole dataset
tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))
tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))
train.py
import argparse
import os
import sys
import time
import mlflow
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from functools import partial
from preprocessing import preprocess
from datasets.utils.logging import disable_progress_bar
if __name__ == "__main__":
#disable_progress_bar() # Disable progress bar during dataset processing
parser = argparse.ArgumentParser() # set up ArgumentParser
parser.add_argument(
"--input-model",
type=str,
default="meta-llama/Llama-3.2-1B",
help="The pre-trained model from Hugging Face to use as basis: https://huggingface.co/models",
)
parser.add_argument(
"--output-path",
type=str,
help="Directory where model checkpoints and outputs will be saved.",
)
parser.add_argument(
"--parquet-file",
type=str,
#default='<path-to-training-data-in-one-parquet-file>',
help="Path to the input Parquet file containing training data.",
)
parser.add_argument(
"--model_output_name",
type=str,
help="Name for the finetuned model to be saved under.",
)
parser.add_argument("--batch_size", "-b", type=int, default=1, help="Training batch size")
parser.add_argument(
"--num-workers",
type=int,
default=1,
help="The number of CPU worker processes to use.",
)
parser.add_argument(
"--resume",
default=False,
action="store_true",
help="If set, continue from a previously interrupted run. Otherwise, overwrite existing checkpoints.",
)
parser.add_argument(
"--max-steps",
type=int,
default=400,
help="The number of training steps.",
)
parser.add_argument("--peft", action="store_true", help="Use PEFT: https://huggingface.co/docs/peft/index")
parser.add_argument(
"--4bit",
dest="bnb_4bit",
action="store_true",
help="Use 4bit quantization with bitsandbytes: https://huggingface.co/docs/bitsandbytes/main/en/index",
)
args, _ = parser.parse_known_args()
# Check for required arguments
if not args.model_output_name:
print("ERROR: --model_output_name must be specified.")
sys.exit(1)
# Read the environment variables provided by torchrun
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
world_size = int(os.environ["WORLD_SIZE"])
local_world_size = int(os.environ["LOCAL_WORLD_SIZE"])
# Initialize MLflow only on the main process (rank 0) to prevent multi-process conflicts
if rank == 0:
# Set the MLflow tracking URI to save logs and artifacts under the specified output directory
mlflow_tracking_uri = os.path.join(args.output_path, "mlruns")
mlflow.set_tracking_uri(mlflow_tracking_uri)
# Use the model output name as the MLflow experiment name
mlflow.set_experiment(args.model_output_name)
print(f"MLflow tracking URI: {mlflow_tracking_uri}")
# this is where trained model and checkpoints will go
output_model_dir = os.path.join(args.output_path, args.model_output_name)
if rank == 0:
print(f"Using {world_size} GPUs.")
print(f"Local {local_world_size} GPUs.")
# Then we determine the device on which to train the model.
if rank == 0:
print("Using PyTorch version:", torch.__version__)
print(f"world_size: {world_size} GPUs.")
print(f"local_world_size {local_world_size}")
print(f"Number of available GPUs (visible to this process): {torch.cuda.device_count()}")
print(f"Rank: {rank}")
if torch.cuda.is_available():
device = torch.device("cuda", local_rank)
print(f"Using GPU {local_rank}, device name: {torch.cuda.get_device_name(device)}")
else:
print(f"No GPU found, using CPU instead. (Rank: {local_rank})")
device = torch.device("cpu")
if rank == 0 and args.batch_size % world_size != 0:
print(f"ERROR: batch_size={args.batch_size} has to be a multiple of the number of GPUs={world_size}!")
sys.exit(1)
if rank == 0:
print(f" output_model_dir: {output_model_dir}")
start = time.time()
tokenizer = AutoTokenizer.from_pretrained(args.input_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
special_tokens = tokenizer.special_tokens_map
if rank == 0:
print("Loading input model and tokenizer")
quantization_config = None
if args.bnb_4bit:
from transformers import BitsAndBytesConfig
print("Using bnb_4bit")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16,
)
quantization_config = bnb_config
model = AutoModelForCausalLM.from_pretrained(
args.input_model,
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
device_map=device,
)
if args.peft:
# peft_config = LoraConfig(
# task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32,
# lora_dropout=0.1
# )
# LoRA config from here:
# https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/scripts/run_fsdp_qlora.py#L128
peft_config = LoraConfig(
lora_alpha=8,
lora_dropout=0.05,
r=16,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
# modules_to_save = ["lm_head", "embed_tokens"] # add if you want to use the Llama 3 instruct template
)
model = get_peft_model(model, peft_config)
print("Using PEFT")
model.print_trainable_parameters()
stop = time.time()
if rank == 0:
print(f"Loading model and tokenizer took: {stop - start:.2f} seconds")
train_batch_size = args.batch_size
eval_batch_size = args.batch_size
if rank == 0:
print(f"Global train and eval batch size : {args.batch_size}")
training_args = TrainingArguments(
disable_tqdm=True,
output_dir=output_model_dir,
save_strategy="steps",
save_steps=50, # MODIFY from quick testing to real training for eg. 50 -> 400!!
save_total_limit=3,
learning_rate=2e-5, #3e-5,
weight_decay=0.01,
bf16=True, # use 16-bit floating point precision
per_device_train_batch_size=train_batch_size // world_size,
per_device_eval_batch_size=eval_batch_size,
dataloader_num_workers=args.num_workers,
ddp_find_unused_parameters=False,
dataloader_pin_memory=True,
metric_for_best_model="eval_loss",
eval_strategy="steps",
eval_steps=100, # MODIFY from quick testing to real training for eg. 100 -> 200!!
num_train_epochs=2,
max_steps=args.max_steps, # COMMENT THIS IF using bigger dataset
# MLflow integration
report_to=["mlflow"],
logging_steps=50, # MODIFY !!
logging_strategy="steps",
# Run name for MLflow — includes SLURM job ID to indentify run
run_name=f"{args.model_output_name}_{os.environ.get('SLURM_JOB_ID')}",
)
#if rank == 0:
# print(f"Training arguments : {training_args}")
# Load parquet data
raw_dataset = load_dataset("parquet", data_files=args.parquet_file)
# Split dataset into train and validation sets
split_dataset = raw_dataset["train"].train_test_split(test_size=0.1, seed=42)
max_tokens = 2048
overlap_tokens = 50
if rank == 0:
print("Dataset columns:", raw_dataset["train"].column_names)
print(f"Type of column_names: {type(raw_dataset['train'].column_names)}")
column_names = raw_dataset["train"].column_names
preprocess_function = partial(
preprocess, tokenizer=tokenizer, max_tokens=max_tokens, chunk_size=8192, overlap_size=overlap_tokens
)
tokenized_train_dataset = split_dataset["train"].map(
preprocess_function,
batched=True,
remove_columns=column_names,
num_proc=args.num_workers,
)
tokenized_val_dataset = split_dataset["test"].map(
preprocess_function,
batched=True,
remove_columns=column_names,
num_proc=args.num_workers,
)
####################################################
# comment these if you would like to use the whole dataset
tokenized_train_dataset = tokenized_train_dataset.shuffle(seed=42).select(range(900))
tokenized_val_dataset = tokenized_val_dataset.shuffle(seed=42).select(range(100))
# Print the sizes to verify
if rank == 0:
print(f"Train dataset size: {len(tokenized_train_dataset)}")
print(f"Validation dataset size: {len(tokenized_val_dataset)}")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")
# Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)
start_train = time.time()
if rank== 0:
print(f"Training starting...")
# Train the model - MLflow will automatically log metrics
trainer.train(resume_from_checkpoint=args.resume)
stop_train = time.time()
if rank == 0:
elapsed = stop_train - start_train
hours = int(elapsed // 3600)
minutes = int((elapsed % 3600) // 60)
seconds = int(elapsed % 60)
print(f"Finetuning model took: {hours}h {minutes}m {seconds}s")
# Save the model
if trainer.is_fsdp_enabled:
trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
trainer.save_model(output_model_dir)
if rank == 0:
print()
print("Training done, you can find the final model (and checkpoints) in", output_model_dir)
print(f"\nMLflow experiment data stored in: {mlflow_tracking_uri}")
4.2. Batch job script for training with 8GPUs#
To run training on LUMI using 8 GPUs, you need to submit a batch job via a SLURM script. Below is an example script named run_train_8gpu.sh.
This script:
Requests resources from the GPU partition (eg. dev-g / small-g) including 8 GPUs, 56 CPU cores, and 480 GB of memory.
Loads the necessary modules for Singularity container support.
Sets environment variables for Hugging Face cache and tokenizer behavior.
Defines an output directory for saving the trained model and logs.
Launches the training inside the container using
torchrunwith distributed training support.
Remember to replace <number-here> with your project ID, <path-to-training-data-parquet-file> with the actual path to your preprocessed training data, and <path-to-training-container> with the path to your training container (e.g., training_env.sif).
Also, consider switching the partition to small-g and adjusting the --time parameter for longer training runs.
run_train_8gpu.sh
#!/bin/bash
#SBATCH --account=project_<number-here>
#SBATCH --partition=dev-g
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=56
#SBATCH --mem=480G
#SBATCH --time=00:15:00
#SBATCH --gpus-per-node=8
module use /appl/local/containers/ai-modules
module load singularity-AI-bindings
# This will store all the Hugging Face cache such as downloaded models
# and datasets in the project's scratch folder
export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache
mkdir -p $HF_HOME
# Path to where the trained model and logging data will go
OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data
mkdir -p $OUTPUT_DIR
TRAINING_DATA_FILE=<path-to-training-data-parquet-file>
# Disable internal parallelism of huggingface's tokenizer since we
# want to retain direct control of parallelism options.
export TOKENIZERS_PARALLELISM=false
set -xv # print the command so that we can verify setting arguments correctly from the logs
CONTAINER=<path-to-training-container>
srun singularity exec $CONTAINER \
torchrun --standalone \
--nnodes=1 \
--nproc-per-node=$SLURM_GPUS_PER_NODE \
train.py $* \
--output-path $OUTPUT_DIR \
--parquet-file $TRAINING_DATA_FILE \
--model_output_name="Llama-3.2-1B-finetuned" \
--num-workers $SLURM_CPUS_PER_TASK \
--batch_size=8
4.3. Run training script#
To train the model on LUMI with 8 GPUs, submit the batch job using the SLURM script provided in run_train_8gpu.sh.
Simply run the following command in the LUMI terminal:
sbatch run_train_8gpu.sh
Once the job starts, a SLURM job file named slurm-{slurm_job_id}.job will be created automatically.
You can monitor the status of your jobs at any time using: sacct.
4.4. Use MLflow to check metrics#
After the training completes, you’ll find the logged data inside the mlruns folder located within your specified output directory. If you didn’t change this part in the run_train_8gpu.sh, the lcoation for mlflow metrics is /scratch/${SLURM_JOB_ACCOUNT}/${USER}/training_output_data/mlruns.
To visualize and monitor your training metrics, you can open an MLflow session via the LUMI web interface. Navigate to Apps -> Mlflow.
Set the Location where MLflow files are stored to the full path where your mlruns folder is located. After launching the session, you can interactively browse training metrics, losses and parameters.
5. Test the model#
After finetuning, you can test the model using a Python script and a SLURM batch job. Inference results will be saved to a logging file for review.
To run inference, simply submit the batch job with: sbatch run_inference.sh
This will generate model outputs for your predefined prompts and log them for inspection.
Note! We don’t need to use the container here since we don’t need any additional packages.
run_inference.sh
#!/bin/bash
#SBATCH --account=project_XXXXXXXXXX
#SBATCH --partition=dev-g
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=7
#SBATCH --mem=60G
#SBATCH --time=0:15:00
#SBATCH --gpus-per-node=1
module purge
module use /appl/local/csc/modulefiles/
module load pytorch/2.5
# This will store all the Hugging Face cache such as downloaded models
# and datasets in the project's scratch folder
export HF_HOME=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-cache
mkdir -p $HF_HOME
export LOG_FILE_PATH=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/inference_logs
mkdir -p $LOG_FILE_PATH
export LOG_FILE=${LOG_FILE_PATH}/inference_prints.log
# Path to where the trained model and logging data will go
OUTPUT_DIR=/scratch/${SLURM_JOB_ACCOUNT}/${USER}/hf-data
mkdir -p $OUTPUT_DIR
# Disable internal parallelism of huggingface's tokenizer since we
# want to retain direct control of parallelism options.
export TOKENIZERS_PARALLELISM=false
set -xv # print the command so that we can verify setting arguments correctly from the logs
MODEL_PATH_1="meta-llama/Llama-3.2-1B"
MODEL_PATH_2="</path/to/your/finetuned/model>"
# Define prompts as an array
PROMPTS=(
"Tekoälyn kehitys muuttaa maailmaa nopeasti ja siksi "
"Tervetuloa "
)
# Run inference for each model and prompt combination
for MODEL in "$MODEL_PATH_1" "$MODEL_PATH_2"; do
for PROMPT in "${PROMPTS[@]}"; do
srun python inference.py \
--model "$MODEL" \
--prompt "$PROMPT"
done
done
inference.py
import logging
import argparse
import torch
import os
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer
LOG_FILE = os.environ.get('LOG_FILE')
slurmjob_id = os.environ['SLURM_JOBID']
# logging file settings
logging.basicConfig(
filename=LOG_FILE,
level=logging.INFO
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--model",
type=str,
help="Path to fine-tuned model directory"
)
parser.add_argument(
"--prompt",
type=str,
help="Prompt for the LLM to continue"
)
args = parser.parse_args()
logging.info(f"Slurmjob_ID : {slurmjob_id}")
logging.info(f"Model Path: {args.model}")
logging.info(f"Prompt: {args.prompt}")
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device {device}")
if device.type == 'cuda':
print(f"Device name is {torch.cuda.get_device_name(device)}")
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(args.model)
model.to(device)
with torch.no_grad():
inputs = tokenizer(args.prompt, return_tensors='pt').to(device)
outputs = model.generate(**inputs, do_sample=True, max_length=200, num_return_sequences=2)
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("Generated Outputs:")
logging.info("Generated Outputs:")
for i, text in enumerate(decoded_outputs):
print(f"\n--- Output {i + 1} ---\n{text}")
logging.info(f"\n--- Output {i + 1} ---\n{text}")
logging.info("-" * 40)
Thank you for following the tutorial — we hope you found it useful!
For more information on the OpenWebSearch.eu project see: https://openwebsearch.eu/
For more information on the LUMI supercomputer and CSC, see: https://www.lumi-supercomputer.eu/, https://www.csc.fi/