Infrastrucutre and Estimated Resource Requirements#
The majority of our planned computing tasks fall in the category of big data processing, i.e., large datasets in a distributed storage with small, independent and highly parallel computing tasks on chunks of data. The prime computing paradigm is Map and Reduce as realized in open-source Frameworks like Hadoop, Apache Flink or Apache Spark , where some of the more sophisticated enrichment and analytics models would require GPU support.
The second important ingredient is a several petabyte sized object storage for big data processing. Resource requirements for our infrastructure can be estimated in terms of previous web indexing efforts and projects. In major search engines, around 60*10^9 webpages, are indexed, with a quite moderate growth rate (which might change with the advances of large language models). In previous work, the WEBIS Group developed ChatNoir, a search engine for research purposes, which indexes open web crawls (i.e., ClueWeb and Common Crawl) with around $3*10^9$ webpages, i.e., 5% of commercial indexes of the Web. Assuming we consider only text and extrapolate ChatNoir storage numbers [BSHP18], we end up with the following esimtation for around 40% of the text wweb:
Entities / Components |
Technical Specification |
---|---|
Estimate for storage raw data (replicated 3 times) |
1500 TiB |
Estimated size of the Open Web Index (replicated 3 times) |
500 TiB (Fast Access) |
Estimated demands for temporary storage for intermediate results |
1000 TiB |
Node requirements for storage and analytics computations |
25 Nodes a 96 cores & 256 GiB RAM |
Node requirements for serving the index |
70 Nodes a 48 cores & 256 GiB RAM |
These infrastructure requirements will be collaboratively addressed through access to shared-usage and dedicated systems at partner sites whereas the distributional aspect adds another challenge.