The Crawling Frontier

The Crawling Frontier#

Drawing inspiration from the URL Frontier of the Crawler Commons project, we envision the Crawling Frontier (OWLer-Frontier) as a specialized extension, tailored to cater to the unique needs of the OpenWebSearch.eu project.

../../_images/crawler-frontier-overview.png

Fig. 7 Role of frontier to coordinate different crawlers#

The URL Frontier is a crawler component which monitors the status of both crawled and to-be-crawled URLs. Within the structure of the OWLer crawling system, it takes the central position in a star-shaped architecture. The term Frontier originally describes the data structure for storing the URLs during a crawl; in our case they are stored in an OpenSearch index. Therefore, we have built upon the existing URLFrontier framework – which has been funded by NGI Zero - in the OpenSearch-based implementation . This open-source software project defines an API for facilitating communication between the frontier services and the crawler nodes. The StormCrawler framework natively supports the URLFrontier programming interface, and is able to retrieve and upload URLs from the Frontier with the API’s GetURLs and PutURLs command, respectively.

The figure below shows an overview over the collaborative crawling architecture, consisting of the OpenSearch backend, several frontier services and peer-to-peer crawling nodes. The OpenSearch instance, which hosts the Frontier index, is shared among several frontier services. This system design has proven valuable in maintaining a single, collaborative crawl. Each frontier service is connected to one crawler, respectively. Due to the light-weight nature of the frontier services, it is easy to start new frontier services which serve as endpoints for future crawlers. Hence, the crawl can be scaled up vertically without being interrupted and new resources can be added in a plug-&-play style manner.

The adoption of URLFrontier was motivated by the nature of our crawling system, which is both heterogeneous, highly distributed and should be open to independent crawling efforts. Within the OWS project, crawlers are located in computing sites across Europe and can join or be removed arbitrarily. The performance of the frontier services is therefore particularly crucial and should not be a bottleneck for the crawling activities. Hence, the Frontier functions as the central storage of URLs and enables peer-to-peer crawling despite the large geographic distances between the crawlers. In this context, peer-to-peer means that several workers collaborate on the same crawl. This is technically possible as the crawl space is partitioned among the frontier services. Our future contribution to the open-source community will be the enhanced implementation of the utilized partitioning method. In order to equip our crawling system for various use case scenarios, we have improved the even, hash-based partitioning of the crawl space in favor of dynamic partitioning. This allows crawlers to define their specific scope of interest and only retrieve URLs within this focus. Overall, this modification facilitates the employment of different crawling pipelines, such as explorative crawling and Sitemap crawling, in a collaborative manner.

The OWS Frontier also serves as point for monitoring the operational status of our services. We collect API request statistics via a Grafana instance, as shown in the screenshot below. Thus, we can observe the operational status of all participating crawling sites in a single dashboard. Note that the Grafana instance monitors the operational metrics for our services, while the ows-metrics service monitors the content-based metrics.

The OWLer Crawling Frontier#

Drawing inspiration from the URL Frontier of the Crawler Commons project, we envision the Crawling Frontier (OWLer-Frontier) as a specialized extension, tailored to cater to the unique needs of the OpenWebSearch.eu project.

../../_images/crawler-frontier-overview.png

Fig. 8 Role of frontier to coordinate different crawlers#

The URL Frontier is a crawler component which monitors the status of both crawled and to-be-crawled URLs. Within the structure of the OWLer crawling system, it takes the central position in a star-shaped architecture. The term Frontier originally describes the data structure for storing the URLs during a crawl; in our case they are stored in an OpenSearch index. Therefore, we have built upon the existing URLFrontier framework – which has been funded by NGI Zero - in the OpenSearch-based implementation . This open-source software project defines an API for facilitating communication between the frontier services and the crawler nodes. The StormCrawler framework natively supports the URLFrontier programming interface, and is able to retrieve and upload URLs from the Frontier with the API’s GetURLs and PutURLs command, respectively.

The figure below shows an overview over the collaborative crawling architecture, consisting of the OpenSearch backend, several frontier services and peer-to-peer crawling nodes. The OpenSearch instance, which hosts the Frontier index, is shared among several frontier services. This system design has proven valuable in maintaining a single, collaborative crawl. Each frontier service is connected to one crawler, respectively. Due to the light-weight nature of the frontier services, it is easy to start new frontier services which serve as endpoints for future crawlers. Hence, the crawl can be scaled up vertically without being interrupted and new resources can be added in a plug-&-play style manner.

The adoption of URLFrontier was motivated by the nature of our crawling system, which is both heterogeneous, highly distributed and should be open to independent crawling efforts. Within the OWS project, crawlers are located in computing sites across Europe and can join or be removed arbitrarily. The performance of the frontier services is therefore particularly crucial and should not be a bottleneck for the crawling activities. Hence, the Frontier functions as the central storage of URLs and enables peer-to-peer crawling despite the large geographic distances between the crawlers. In this context, peer-to-peer means that several workers collaborate on the same crawl. This is technically possible as the crawl space is partitioned among the frontier services. Our future contribution to the open-source community will be the enhanced implementation of the utilized partitioning method. In order to equip our crawling system for various use case scenarios, we have improved the even, hash-based partitioning of the crawl space in favor of dynamic partitioning. This allows crawlers to define their specific scope of interest and only retrieve URLs within this focus. Overall, this modification facilitates the employment of different crawling pipelines, such as explorative crawling and Sitemap crawling, in a collaborative manner.

The OWS Frontier also serves as point for monitoring the operational status of our services. We collect API request statistics via a Grafana instance, as shown in the screenshot below. Thus, we can observe the operational status of all participating crawling sites in a single dashboard. Note that the Grafana instance monitors the operational metrics for our services, while the ows-metrics service monitors the content-based metrics.