OWI Concepts & Definition#
A Web Index is a data structure that allows fast content based access, sorting, and filtering of large web documents and forms the core of every web search engine today. Usually an inverted index structure is used, where content units (e.g. words, metadata) point to a list of web-documents they occur in. The quality of a web index stems from the quality of the indexed documents augmented by additional signals, e.g. usage information, metadata or link structure, which allow fine-tuning the search rankings to user needs.
The Open Web Index (OWI) (see also [GVF+23] for more details) aims to create an index of the Web and provide the index in the form of open data. We build on the Open Web Index (OWI) concept from [Lew19], but do not aim to serve it behind an an API, but make it available as Open Data. It is comparable to Common Crawl and the Internet Archive, but with the goal to enrich Web pages and index them.

Fig. 3 Illustration of key concept of an OWI as basis for an open, extensible, transparent and legally sound web search ecosystems. Colors depict different parts of the Web and contributions from different third parties in building an web index. Since the index is open data, it can be sliced and diced as required by the search engine provider, without having to crawl or index the Web itself.#
In order to advance the innovative idea of an OWI, we propose 6 principles for building an OWI and present a brief conceptual overview in the figure above:
Principle 1. Distributed as Open Data: Instead of solely providing an API over an openly created web index, we propose to view the index as open data, i.e. an openly created data structure available under some open (use) licences. Consequently, the OWI can be sliced and diced as needed by a potential search engine provider, or be used for completely other means, including, for example, the data for training neural language models. A proper public licensing model can allow the acceptance of third party contributions to the index or to parts of it, opening up the road for collaborative index creation.
Principle 2. Open and Extensible Index Creation: Index creation should be transparent and extensible. All pipelines used—from crawling to pre-processing to indexing—have to be open source, and their configuration needs to be exposed openly. Extensibility empowers third parties to contribute (algorithmic) components to the index creation pipelines. Hence, the pipelines will contain up-to-date semantic enrichment models, and give researchers and innovators the opportunity to explore their methods at scale.
Principle 3. Collaborative Creation: Ideally, an OWI would be created collaboratively by different domain experts, similar to Wikipedia. Adding domain knowledge to the index should significantly increase its quality. However, different to the case of Wikipedia, the underlying process is technically complex, which might require more technically oriented intermediaries (e.g., digital libraries, research computing centers, etc.). However, if the pipelines for index creation can be decentralised among independent computing centers, the costs for index creation could be also significantly reduced, yielding a cost-efficient, high-quality index.
Principle 4. Tracking Content Usage, not the User: An OWI must not collect data about individual users, even if this would be an important data source for optimizing search and retrieval processes. Assuming that the OWI is used in a lot of different search engines, one could collect aggregated information about the content usage in those engines, instead of collecting information from individual users. It would be up to the search engines to collect and aggregate click data for user groups. Such aggregated, anonymized usage data could be managed in addition to the OWI as a by-product, but not necessarily fully integrated into an index.
Principle 5. Control to the content owners: Content owners should be empowered to control the usage of their content in an OWI, on a more fine-grained level than is possible using current approaches like the de facto
robots.txt
standard. This includes provision of legal information, like for example machine-readable content licensing, or, on the other side, compliance with jurisdictional requirements like GDPR. Similarly, through principle 4, web content owners will be informed on usage details of their content opening up opportunities for new business models.Principle 6. Legal compliance for content users: Due to different legal frameworks in different countries, legal uncertainties when crawling and pre-processing web data remain high, e.g.\ regarding intellectual property and licensing rights. The current gatekeepers hold a unique position such that content owners have to waive the rights to use their content, for the possibility to be found. Providing an OWI needs processes that consider different legal frameworks, ensure legal usage of content and the exclusion of illegal content (see e.g. [EGP21]).
We argue, that these properties will allow to create an open, extensible, transparent and legally sound web search ecosystems which yields to new business models and empowers end-users with different models of web search.