Crawling#
We develop the Open Web Crawler (OWLer) , which is an configuration and extension of the StormCrawler.
Crawling takes place on the following two cluster tiers:
The Crawling Cluster Tier (CCT) contains a crawler responsible for crawling the Web, thereby creating WARC files (i.e., collections of HTTP streams from the crawling process). These WARC files are stored in an S3 bucket that can be accessed by the other tiers. There can be multiple CCT, but there should be only one CCT per data center, as the CCT can scale with the amount of resources provided.
The Crawler Frontier Tier (CFT) is a singleton tier (i.e., only one instance exists across all participating data centers). It coordinates the multi-tier crawling process, keeping track of crawled URLs, access statistics and cache digest. The crawler frontier is the primary data source for the website registry, which provides the interface for interaction between webmasters and the frontier (e.g., an interface for take-down requests).
The CCT consists of the OWLer (OWS.eu crawler), a distributed web crawler built with StormCrawler running on Apache Storm, at each individual data center. Aside from regular crawling functionality, the OWLer also includes a classifier for distinguishing between benign, malicious and adult content.
The CFT consists of a single OpenSearch instance running at one of the data centers. It also contains a metrics server which aggregates logs from different stages and allows for (limited) statistical analysis.
WARC Files - Web Archive Files#
The crawler stores data in the WARC file format (ISO 28500:2017. In essence the WARC captures the full HTTP Request / Response stream and stores it as record. Potentially, additional metadata can be added.
Below you can find the example of an WARC Entry from the first webpage.
::::{dropdown} WARC File Format :::{dropdown} WARC Entry HTTP Request
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2023-11-10T14:47:02Z
WARC-Record-ID: <urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC-Filename: crawl.warc.gz
WARC-Block-Digest: sha1:MAOT3SNQXORCVFADNQMKR6R36KZSMFD3
Content-Length: 284
software: Wget/1.21.2 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
robots: classic
wget-arguments: "--input-file" "urls.txt" "--level" "2" "--delete-after" "--no-directories" "--warc-file" "data/crawl"
WARC/1.0
WARC-Type: request
WARC-Target-URI: <http://info.cern.ch/hypertext/WWW/TheProject.html>
Content-Type: application/http;msgtype=request
WARC-Date: 2023-11-10T14:47:02Z
WARC-Record-ID: <urn:uuid:322fbe7f-5ed9-4f11-b03c-0a935479ebb6>
WARC-IP-Address: 188.184.100.182
WARC-Warcinfo-ID: <urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC-Block-Digest: sha1:OQS3GN6DUYZYKNEZUB3C43BPKBUKFJEO
Content-Length: 156
GET /hypertext/WWW/TheProject.html HTTP/1.1
Host: info.cern.ch
User-Agent: Wget/1.21.2
Accept: */*
Accept-Encoding: identity
Connection: Keep-Alive
:::
:::{dropdown} WARC Entry HTTP Response
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:ad958233-e57a-4830-a943-6fcbd98b6c02>
WARC-Warcinfo-ID: <urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC-Concurrent-To: <urn:uuid:322fbe7f-5ed9-4f11-b03c-0a935479ebb6>
WARC-Target-URI: <http://info.cern.ch/hypertext/WWW/TheProject.html>
WARC-Date: 2023-11-10T14:47:02Z
WARC-IP-Address: 188.184.100.182
WARC-Block-Digest: sha1:IB2ELXU6UBK2NVUUQDJACRGMAN6CUQEG
WARC-Payload-Digest: sha1:PFXVPF7CXJ2AW372EVGRLQ3OLXX24HJT
Content-Type: application/http;msgtype=response
Content-Length: 2450
HTTP/1.1 200 OK
Date: Fri, 10 Nov 2023 14:47:02 GMT
Server: Apache
Last-Modified: Thu, 03 Dec 1992 08:37:20 GMT
ETag: "8a9-291e721905000"
Accept-Ranges: bytes
Content-Length: 2217
Connection: close
Content-Type: text/html
<HEADER>
<TITLE>The World Wide Web project</TITLE>
<NEXTID N="55">
</HEADER>
<BODY>
<H1>World Wide Web</H1>The WorldWideWeb (W3) is a wide-area<A
NAME=0 HREF="WhatIs.html">
hypermedia</A> information retrieval
initiative aiming to give universal
access to a large universe of documents.<P>
Everything there is online about
W3 is linked directly or indirectly
to this document, including an <A
NAME=24 HREF="Summary.html">executive
summary</A> of the project, <A
NAME=29 HREF="Administration/Mailing/Overview.html">Mailing lists</A>
, <A
NAME=30 HREF="Policy.html">Policy</A> , November's <A
NAME=34 HREF="News/9211.html">W3 news</A> ,
<A
NAME=41 HREF="FAQ/List.html">Frequently Asked Questions</A> .
<DL>
<DT><A
NAME=44 HREF="../DataSources/Top.html">What's out there?</A>
<DD> Pointers to the
world's online information,<A
NAME=45 HREF="../DataSources/bySubject/Overview.html"> subjects</A>
, <A
NAME=z54 HREF="../DataSources/WWW/Servers.html">W3 servers</A>, etc.
<DT><A
NAME=46 HREF="Help.html">Help</A>
<DD> on the browser you are using
<DT><A
NAME=13 HREF="Status.html">Software Products</A>
<DD> A list of W3 project
components and their current state.
(e.g. <A
NAME=27 HREF="LineMode/Browser.html">Line Mode</A> ,X11 <A
NAME=35 HREF="Status.html#35">Viola</A> , <A
NAME=26 HREF="NeXT/WorldWideWeb.html">NeXTStep</A>
, <A
NAME=25 HREF="Daemon/Overview.html">Servers</A> , <A
NAME=51 HREF="Tools/Overview.html">Tools</A> ,<A
NAME=53 HREF="MailRobot/Overview.html"> Mail robot</A> ,<A
NAME=52 HREF="Status.html#57">
Library</A> )
<DT><A
NAME=47 HREF="Technical.html">Technical</A>
<DD> Details of protocols, formats,
program internals etc
<DT><A
NAME=40 HREF="Bibliography.html">Bibliography</A>
<DD> Paper documentation
on W3 and references.
<DT><A
NAME=14 HREF="People.html">People</A>
<DD> A list of some people involved
in the project.
<DT><A
NAME=15 HREF="History.html">History</A>
<DD> A summary of the history
of the project.
<DT><A
NAME=37 HREF="Helping.html">How can I help</A> ?
<DD> If you would like
to support the web..
<DT><A
NAME=48 HREF="../README.html">Getting code</A>
<DD> Getting the code by<A
NAME=49 HREF="LineMode/Defaults/Distribution.html">
anonymous FTP</A> , etc.</A>
</DL>
</BODY>
::: :::{dropdown} WARC Entry Metadata plus resources
WARC/1.0
WARC-Type: metadata
WARC-Record-ID: <urn:uuid:32d10e12-00e6-4b4b-98f3-530dbc839ef5>
WARC-Warcinfo-ID: <urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC-Target-URI: <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
WARC-Date: 2023-11-10T14:47:02Z
WARC-Block-Digest: sha1:SFDEOQVUCCEKMJ7UMM6YL4CPNJKGYS72
Content-Type: text/plain
Content-Length: 48
<urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:4109d20f-274f-4b53-a303-7a564f76529b>
WARC-Warcinfo-ID: <urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC-Concurrent-To: <urn:uuid:32d10e12-00e6-4b4b-98f3-530dbc839ef5>
WARC-Target-URI: <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
WARC-Date: 2023-11-10T14:47:02Z
WARC-Block-Digest: sha1:DW4RWCZI622AZOZBAZ3NOZIF6FM57LKE
Content-Type: text/plain
Content-Length: 104
"--input-file" "urls.txt" "--level" "2" "--delete-after" "--no-directories" "--warc-file" "data/crawl"
WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:664641c1-dc9c-439b-9855-393448228afa>
WARC-Warcinfo-ID: <urn:uuid:e965fba4-5ea9-432a-bd6c-a0b9c2cea709>
WARC-Concurrent-To: <urn:uuid:32d10e12-00e6-4b4b-98f3-530dbc839ef5>
WARC-Target-URI: <metadata://gnu.org/software/wget/warc/wget.log>
WARC-Date: 2023-11-10T14:47:02Z
WARC-Block-Digest: sha1:PJRVSYREY5FZ22ERXZGXJW4AF7JNCUOW
Content-Type: text/plain
Content-Length: 687
Opening WARC file ‘data/crawl.warc.gz’.
--2023-11-10 15:47:02-- http://info.cern.ch/hypertext/WWW/TheProject.html
Resolving info.cern.ch (info.cern.ch)... 188.184.100.182, 2001:1458:d00:35::100:222
Connecting to info.cern.ch (info.cern.ch)|188.184.100.182|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2217 (2,2K) [text/html]
Saving to: ‘TheProject.html.tmp’
0K .. 100% 213M=0s
2023-11-10 15:47:02 (213 MB/s) - ‘TheProject.html.tmp’ saved [2217/2217]
Removing TheProject.html.tmp.
FINISHED --2023-11-10 15:47:02--
Total wall clock time: 0,08s
Downloaded: 1 files, 2,2K in 0s (213 MB/s)
::: ::::
Current Status#
Currently (end of July, 2023), the CCT has been deployed successfully at three partner sites (UNI PASSAU, BADW-LRZ and IT4I@VSB) and is in operation at two (UNI PASSAU and BADW-LRZ). The CFT is deployed at CERN. At the time of writing, the CCT crawls roughly 200GB of content per day but can be scaled up by using more machines and we aim to scale it up to consistent 1 TB/day until end of year.This requires also to resolve security related issues, like raising false positive network security alarms when accessing botnets. However, the frontier still remains a bottleneck to be scaled up accordingly with the number of machines used.
The crawling process will be made transparent via the [OWLer Dashboard](TBA SOON), which also facilities on demand crawling services.
OWLer Implementation Details#
Installation#
The OWLer Toolkit provides means (e.g. Ansible Scripts) for installing OWLer.
From StormCrawler to OWLer#
StormCrawler is an open-source software project used for web crawling that is built on Apache Storm . Apache Storm, designed for real-time processing of large volumes of data, is scalable, robust, and fault-tolerant, making it ideal for web crawling tasks. StormCrawler extends this functionality, providing a collection of resources and tools to implement and deploy a scalable web crawler. It allows the extraction of a vast array of data and metadata, supports various protocols, and can be customized to handle unique use-cases. StormCrawler’s architecture involves the concept of “spouts” and “bolts.” Spouts are the data sources or streams, while bolts process the input data. Together, they form a “topology,” or a network of spouts and bolts working in concert to complete a task. A critical feature is its ability to scale linearly, permitting a seamless increase in crawling speed based on resources available.
Deriving from the power and flexibility of StormCrawler, we developed the Open Web Crawler (OWLer). OWLer is an enhancement and a tailored adaptation, designed to accommodate the specific needs of the OpenWebSearch.eu project. OWLer inherits the robust and scalable nature of its progenitor, and further extends the topologies with additional spouts and bolts to fit the demands of the project. To manage and optimize the crawling process, we have fine-tuned these topologies based on our requirements, creating a system that balances efficiency and respect for the websites we crawl.
Crawling Pipelines#
In OWLer, we utilize the inherent flexibility of StormCrawler’s topologies to create bespoke spouts and bolts.
The efficiency of the Open Web Crawler hinges on core pipeline elements that enable proficient web crawling. Each pipeline conducts HTML parsing, hyperlink extraction, possible URL filtering, and caching checks via HTTP to ensure that the crawled page has not been duplicated within our system. It also performs content-based hashing to assign hash-based identifiers for URLs and generates status metrics and WARC files. Communication with the URL Frontier is enabled through a designated API which fills the crawling pipeline, reports newly unearthed URLs, and refreshes the status of known URLs. In an effort to manage potential security or licensing breaches more effectively, we have instituted a risk-based categorization of crawling pipelines. The following three conceptual pipelines have been developed:
WARC2WARC Pipeline#
This pipeline assimilates WARC files provided externally into the index and populates the frontier with identified links. This allows to continuously integrate crawling results from external crawlers, most prominently Common Crawl, into our pipelines. Functionally, it retrieves WARC files from S3-compliant storage, parses and extracts the embedded content and relevant hyperlinks, submits the extracted URLs to the frontier, and archives the WARC entries without modification. This non-fetching crawling pipeline hinges solely on already obtained web content, thus eliminating direct risks associated with unauthorized or impolite crawling. For data and high performance computing centers, this crawling mode also eliminates potential security impacts and alarms (e.g. due to accessing bot sites while crawling) as well as complaints from webmaster.
Regular Crawling Pipeline#
Designed for explorative crawling, the regular crawling pipeline consumes URLs from the frontier, follows the general pipeline elements for URL crawling, appends the extracted URLs to the frontier, and archives new pages. Given its unrestricted scope, it carries the highest risk for security and licensing infractions. Some risks, such as inadvertent interaction with a botnet, are inevitable in the context of explorative crawling. Also, the crawling mode can face requests and/or complaints from webmasters. While we developed a corresponding web-page to inform webmasters about our crawling activity and how to control our crawler (see https:/ows.eu/owler/ or also see the Appendix), exploratory crawling can still trigger requests from webmaster (currently handled via e-mail). Thus, this crawling mode needs a stricter governance.
Sitemap Crawling Pipeline#
Unlike the regular pipeline that discovers new webpages traditionally via hyperlink extraction, this pipeline adopts an alternative link discovery method - Sitemaps. These are crawler-friendly tools utilized by webmasters to expedite robot access to website content. The crawler retrieves and parses Sitemaps as well as webpages discovered through Sitemaps. Sitemaps also specify changes on a website and allow for refreshing content in an efficient and timely manner. This pipeline has a lower risk profile as the retrieved webpages are anticipated to be crawled based on the Sitemap mechanism.
Logging and Tracking#
In our aim to provide transparency and maintain accountability, we have moved our log files to a central ows-metric service hosted by CERN . The services tracking the progress and monitoring the activities of our crawling operations, so logs can be accessed. For security reasons, the access is IP controlled. This transparency and accessibility ensure that we can maintain a high standard of accountability and continuous improvement in our processes. We have developed dashboards to get insights in the crawling behaviour and the crawled websites. The following screenshot shows the dashboards on page fetches and crawled URLs. The second dashboard allows for (a limited) content-based analysis of the current crawling status, by analysing http-status codes, statistics of domains-crawled etc.
Pipeline Risk Assessment#
From the viewpoint of an infrastructure provider, crawling can be associated with a number of risks and potential requests from webmaster or content owners. Consequently, assessing potential risks, costs and efforts per pipeline becomes essential.
The following figure

Fig. 6 Crawling pipelines and associated risk levels, where higher risk level means higher probability to get interaction from the crawled websites.#
Below we list some details on potential issues arising by running broad crawls.
Institutional Licenses#
Running crawlers from within a complex organisation, likein universities, institutional licenses can present a potential stumbling block. They occure because a crawler might operate under an IP address, which can access licensed content (e.g. PDF from digital libraries for a univesity). Unauthorized PDF downloads, for instance, could lead to license infringements. So before running a crawler, potential IP-based access to licensed content needs to be assessed before
Botnets#
Another issue could be accessig botnets through a crawler. While OWLer maintains a frontier with information on botnets, access to those servers cannot be limited with 100% guarantee. However, accessing bot nets might raise security alarms within network department and organisations. Although the alarms can be considered as false positives as the crawlers are not excuting any JavaScript code, handling the alarms is still required. Also, crawling machines need to be as secured as possible, as the crawler IP becomes known to the attacker.
Hint
Overall we suggest separate IP ranges for crawlers to migitate the costs associated from crawling botnets or from institutional licenses.
The Politeness of Crawling#
It is crucial for our operations to maintain a level of politeness while crawling. We ensure this by implementing a delay between fetches from the same domain/host. Despite our stringent policies, we have encountered situations where one of our crawlers was IP-blacklisted by a domain. One problem here is that acceptable delays are not standardized across websites, which requires a more fine-grained level of control, as it is planned in the open web console. The standard framework for our crawler operates within the parameter of one request per second per domain. However, for certain sites, this rate seems impolite. To counteract this, we have increased the time window per domain to 30 seconds