Methodology#
Methodologically the project follows a mixture of conceptual considerations, algorithmic innovations and technology developments. We structure our research along the data processing chain for search and discovery systems while considering different stakeholders and services as shown in the following figure.
The envisioned system comprises several integrated components designed to optimize web data search and discovery:
Crawling and Management: Coordinates crawling and support collecting information from website owners and content creators. This results in an Open Website Index. It includes metadata (categorization, topics, legal status, license info). This index streamlines the crawling process, maps the web for public access, and identifies potential URL issues. Additionally, it manages web crawl results either internally or from external sources.
Preprocessing and Enrichment: Offers tools for cleaning and refining web data obtained from crawls. The system processes web pages, datasets, code, and other objects. It also provides semantic enrichment, personal data removal, and quality assessments. Users can thus create specialized, high-quality web data sets.
Indexing and Search Architectures: Provides indexing services, generating full-text and metadata search indices for the refined web data. It allows various organizations or individuals to maintain separate indices, fostering diverse search engine choices for users. The system can combine these individual indices into one master index and distribute search queries in a federated manner. This balance of centralized and decentralized approaches enhances user search experience.
Search Verticals and Applications: Creates specialized search applications, each designed for specific user needs. Beyond specialized search engines, it includes data products, AI models, and other features. User-focused elements like trust and transparency models are integrated, offering users insights into the search process. While the main system focuses on specific verticals, it invites the wider community to develop more, promoting a diverse search ecosystem.
Data Storage and Infrastructure: Essential distributed storage facilities are included, supporting the above services. Alongside storage, it enables FAIR data handling, sharing, and high-performance analysis. While not storing every web data set, it manages indices, collaborates across organizations, and ensures data provenance tracking. With a focus on trust, privacy, and security, the system provides infrastructure and models to meet these needs.