The OpenWebSearch Engine Hub (OWSE-HUB)#
To define downstream search engines using the OWI, we will also introduce the Open Web Search Engine Hub (OWSE-HUB). Similar to the OWI, the OWSE-HUB forms a web-based information system comparable to the Docker hub, but it will contain complete search engine stacks to enable the fast and easy creation of new search verticals. The architecture of the envisioned OWSE-Hub is outlined in the figure below.
Fig. 5 General architecture of the OWSE-Hub, and how the (federated) search strategy would declaratively capture the usage of indexes retrieved through specifications in the OWSE-Hub. Users can (1-3) pull search engine stacks, (4) build their own specifications for a (composite) search engine, and (5) push specifications to share with others.#
Using the OWSE-Hub, users can declaratively define their own search configurations. Users can “pull” pre-defined specifications from the OWSE-Hub, use those to “build” their own custom search engines, and “push” the most useful ones to share these with others. This flexible setup allows for the creation of a wide variety of search engines, not only for commercial usage but also for personal and corporate search, and allowing both centralized and federated search setups.
Modular Pipeline#
The search engine specifications in the OWSE-Hub describe a modular pipeline in which results are retrieved from one or more search engines and then enriched or transformed by subsequent modules. The result list is modeled as a relational table or dataframe, where each row contains a single result with associated metadata (e.g. full text, URL, etc.). Each module applies a transformation on the input dataframe (such as re-ranking, summarization, etc.), and typically stores the result of the transformation in an additional column. An overview of such modular pipelines can be found below.
Fig. 6 A modular search pipeline that retrieves documents from a search engine, summarizes the full text of each document and then re-ranks the result set.#
Our experimental search engine MOSAIC-RAG [HSNGurtl25] implements this modular pipeline and a base suite of processing modules. This suite is easily extensible to support additional modules, provided they can be implemented as transformations on dataframes.
Specifications#
In order to save and exchange search engine specifications through the OWSE-Hub, we need to serialize the configuration of the modular pipeline. We have chosen to use a simple JSON-based specification format, in which we store the pipeline steps and other generic information of the search engine. Each module is uniquely defined by an ID, and can optionally be configured by a set of parameters. The resulting specification looks like the following:
{
"pipeline": {
"1": {
"id": "module_1_id",
"parameters": {
"param_name": "param_value",
...
}
},
...
},
"title": "Example Search Engine",
...
}
For instance, if we configure MOSAIC-RAG to use a single MOSAIC index as retrieval backend and then apply a document summarizer to the top 10 results by MOSAIC, the specification will look like this:
{
"pipeline": {
"query": "",
"1": {
"id": "mosaic_datasource",
"parameters": {
"output_column": "full-text",
"url": "https://mosaic.ows.eu/service/api/",
"limit": "10",
"search_index": "simplewiki"
}
},
"2": {
"id": "llm_summarizer",
"parameters": {
"model": "gemma2",
"input_column": "full-text",
"output_column":"summary",
"summarize_prompt":"Summarize: "
}
}
}
}
MOSAIC-RAG can also save such configurations on its server under a unique ID. The search engine can then easily be loaded using a custom URL.
Existing Modules#
We distinguish five categories of modules: data sources, pre-processing, re-ranking, summarisation, and metadata analysis. In this section, we briefly describe the modules that are currently supported by MOSAIC-RAG. For a more in-depth explanation of the different modules, we refer to the paper describing MOSAIC-RAG [HSNGurtl25].
Data Sources#
The data source modules deal with retrieving search results from external search engines when a user issues a query. Currently two search engines are supported: MOSAIC (sparse retrieval) and ChromaDB (dense retrieval). These search engines are accessed through an API, and the result lists are transformed into the relational model described above for further processing by downstream modules. Multiple data sources can be added to the same pipeline, allowing for the aggregation of data from different sources (e.g. multiple MOSAIC instances over different indexes).
Pre-processing#
The pre-processing modules mainly deal with text cleaning and organisation of the result set. There are modules to remove HTML tags and stop words, or to perform stemming operation on the text. These functions might not be needed in every case, as the search results may already be cleaned by the original search engine. When applying expensive modules downstream, the Reduction Module is especially interesting: it reduces the size of the result set based on a condition (usually the position in the ranking), such that later modules have to process fewer documents.
Re-ranking#
The re-ranking modules change the ranking of the result set. Currently, four re-ranking modules are implemented by default in MOSIAC-RAG. The embedding re-ranker performs a new ranking based on the similarity of embedding vectors. A cheaper alternative re-ranks the results based on TF-IDF or BM25 similarity measures.
We also implemented two modules that apply LLMs for re-ranking: a group-style re-ranker that repeatedly asks the LLM to choose the most relevant document out of a specific subset of documents, and a tournament-style re-ranker that only refines the existing ranking and focuses on moving the most relevant documents to the top of the ranking.
Summarisation#
There are two types of summarisation modules. The first one summarises the full text of each web document in the result set, while the second one generates one summary of all the documents in the result set. Both modules apply an LLM with a targeted prompt for that specific task.
Metadata Analysis#
Finally, there are currently three metadata analysis modules: one (basic) module simply counts the number of words in a document, another calculates sentiment scores, and the final module detects which parts of the full text are most relevant. Both the sentiment analysis and the relevance marking module are implemented using LLMs.