Tutorial 9: How to Develop New Modules for Resilipipe

Tutorial 9: How to Develop New Modules for Resilipipe#

Each record in a WARC file is processed sequentially in a modular fashion as illustrated below.

alt text

A central advantage of designing the pipeline in a modular fashion is that everyone can contribute new modules. If you want to develop your own module, simply implement one of the abstract classes defined in abstract.py.

Module Classes#

The classes differ in their position in the pipeline, their input parameters, and the types of records they are applied to. The table below gives an overview on the classes:

Module Class

Input parameters

Position in the pipeline

BasicModule

-record: Instance of fastwarc.WarcRecord
- record_dict: Dictionary with data from the HTTP and WARC headers

At the beginning of the pipeline, after parsing HTTP and WARC headers; Applied to all records

HTMLModule

tree: HTMLTree, plain_text: str, language: str, json_ld: Sequence, microdata: Sequence

After parsing the HTML and microdata/JSON-LD; Applied only to records with MIME-type text/html

PostProcessingModule

-record: Instance of fastwarc.WarcRecord
-record_dict: Dictionary with data from all previous processing steps

At the end of the pipeline; Order depends on the order in the modules.yaml; Applied to all records

Class Methods#

Each Module-class needs to implement the following static methods (@staticmethod):

Method

Description

prepare

Prepare the module when first deploying the pipeline. This will be called everytime Resilipipe is deployed at a new data center.
Example: Download resources that stay the same between runs of the pipeline.

load_input

Load data or instantiate classes that are the same for all records to be processed. Return them as a dictionary with {VARIABLE_NAME: DATA} to be used by the process_record function.
If your module does not need prepared data, return an empty dictionary.

process_record

Take inputs defined by the module class. Return the result(s) of your module as a dictionary of {COLUMN_NAME: OUTPUT}.
Empty output should be returned as {COLUMN_NAME: None}.

get_spark_schema

Return a list of StructField elements with field name and Pyspark data type for each return column of the module.

get_pyarrow_schema

Return a list of pa.field with column name and Pyarrow data type for each return column of the module

Add your module#

After implementing your module using one of the abstract classes in abstract.py., you can add it to the pipeline by

  1. Adding the MODULE.py to the modules directory

  2. Putting the name of the module-file (without .py) into modules.yaml

These steps are illustrated for links.py and collection_indices.