Tutorial 9: How to Develop New Modules for Resilipipe#
Each record in a WARC file is processed sequentially in a modular fashion as illustrated below.
A central advantage of designing the pipeline in a modular fashion is that everyone can contribute new modules. If you want to develop your own module, simply implement one of the abstract classes defined in abstract.py.
Module Classes#
The classes differ in their position in the pipeline, their input parameters, and the types of records they are applied to. The table below gives an overview on the classes:
Module Class |
Input parameters |
Position in the pipeline |
---|---|---|
|
- |
At the beginning of the pipeline, after parsing HTTP and WARC headers; Applied to all records |
|
|
After parsing the HTML and microdata/JSON-LD; Applied only to records with MIME-type |
|
- |
At the end of the pipeline; Order depends on the order in the |
Class Methods#
Each Module
-class needs to implement the following static methods (@staticmethod
):
Method |
Description |
---|---|
|
Prepare the module when first deploying the pipeline. This will be called everytime Resilipipe is deployed at a new data center. |
|
Load data or instantiate classes that are the same for all records to be processed. Return them as a dictionary with |
|
Take inputs defined by the module class. Return the result(s) of your module as a dictionary of |
|
Return a list of |
|
Return a list of |
Add your module#
After implementing your module using one of the abstract classes in abstract.py., you can add it to the pipeline by
Adding the
MODULE.py
to the modules directoryPutting the name of the module-file (without
.py
) into modules.yaml
These steps are illustrated for links.py and collection_indices.