Knowledge Graph-Enhanced Earth Observation Applications#
Abstract#
We introduce two complementary applications designed to make scientific information in the Earth Observation (EO) domain more accessible and meaningful. Both tools support researchers, analysts, and data-driven decision makers in exploring and understanding EO data and knowledge.
The first application, RAG-based Model for EO Question Answering, allows users to ask complex, natural-language questions about Earth Observation and receive clear, grounded answers supported by scientific publications, datasets, and curated web sources.
The second application, Geo-Contextualized Multi-Genre Scientific Search, is an interactive web platform for exploring EO information spatially. Users can search by topic or location, visualize results on a map, and discover relationships between research papers, web content, and satellite datasets.
Together, these tools aim to simplify knowledge discovery and enhance trust in AI-assisted scientific exploration by connecting multiple information sources within an intuitive, research-oriented environment.
Use Case#
Key Functionality#
AI-Assisted Question Answering: Users can ask natural-language questions such as “How do wildfires affect land-surface temperature?” The system retrieves relevant scientific papers, datasets, and curated web content, then generates grounded answers.
Exploratory Search and Discovery: A visual interface enables users to search by topic or geographic region. Results are presented on interactive maps and linked lists, helping users explore relationships between datasets, publications, and EO missions.
Hybrid (lexical-/semantic- based ) Contextualization: Both systems retrieve information using a hybrid approach that combines lexical matching (BM25/Lucene) with semantic similarity. This enables precise keyword grounding while capturing conceptual relations across web and scientific sources within a unified contextual layer.
Context and Benefit#
These applications target Earth Observation scientists, data managers, and interdisciplinary researchers seeking trustworthy, explainable, and georeferenced access to EO knowledge. They bridge two common gaps:
Hallucination and context loss in LLMs for domain-specific QA.
Fragmentation of scientific data across heterogeneous repositories.
By connecting webdata, scientific text, datasets, and spatial context, the systems improve trust, interpretability, and discovery efficiency in EO research.
Application#
Conceptual Description and Architecture#
Both applications are built on a common data reservoir that structures, enriches, and interlinks information from diverse Earth Observation (EO) sources. At its core, the systems use two complementary data pipelines: a web index for open EO content and a multi-genre knowledge graph connecting scientific datasets and publications.
The web index provides digestible and vast information related to EO. And the knowledge graph, consisting of more than two million nodes, links resources from OpenAlex (pee-reviewed scientific publications), PANGAEA (datasets), and the DLR Geoservice (satellite data). Together, these components enable breadth and explorative access to EO information.
Below we detail the conceptual design, describing first the data curation components used by both systems, then we describe the workflow and component for each system.
Data Curation and Semantic Integration#
The data curation workflow is organized into two streams:
Web Data Stream - Collects and indexes relevant EO-related webpages to support retrieval-augmented generation (RAG) and enrich open-domain content, using
OWILIXandMOSAIC.Scientific Data Stream - Processes publications and datasets, semantically linking them through NASA’s GCMD (Science Keywords) taxonomy, one of the most established classification schemes in Earth Observation.
By merging web-based and scientific content, the system ensures both breadth and precision in knowledge discovery—supporting users in answering domain questions and exploring EO data through spatial and semantic relations.
.
1. RAG-based Model for Earth Observation Question-Answering#
The workflow of the RAG-Based model is as follows:
Data Curation (detailed above)
Generation. Two-step RAG: zero-shot generation $\rightarrow$ refinement using retrieved context from (1.).
Evaluation LLM-based evaluation (“LLM-as-a-Judge”) for groundedness, helpfulness, and depth.
Technology Stack: LangGraph, LangChain, ChromaDB, ArangoDB.
2. Geo-Contextualized Multi-Genre Scientific Search#
Data Curation (detailed above)
Frontend: Vue.js interface with list and map views.
Backend: FastAPI (ASGI), spaCy (NER + geolocation), SentenceTransformers, ArangoDB Graph.
Visualization: Leaflet-based interactive maps displaying EO data footprints.
Screenshot#
We show below a screenshot of our EO Question-Answering model:

We show below a screenshot of our EO Search Engine:

Index Data#
As mentioned earlier, the data curation step contains a web index from the Open Web Search data, and a knowledge graph interlinking several datasources. In the table below, we summarize all the datasources we use.
Data Sources#
Source |
Type |
Purpose |
|---|---|---|
Open Web Search (OWS) |
Web documents |
Blogs and grey literature for context |
OpenAlex |
Publications, metadata |
Scientific abstracts and topics |
PANGAEA |
Datasets |
Georeferenced Earth Science data |
DLR EOC Geoservice |
STAC Collections |
Satellite data and EO missions |
NASA GCMD |
Taxonomy |
Semantic linking and keyword extraction |
Compilation Process#
TaxoTagger matches texts to NASA GCMD concepts using embedding similarity. This tool is used to filter the Open Web Search (OWS) and the OpenAlex data to match the Earth Observation theme.
Corpus Annotation Graph (CAG) Builder populates the knowledge graph using ArangoDB.
Data enriched with author, keyword, and mission metadata, then indexed for semantic and geospatial search.
Evaluation#
RAG-based Model for Earth Observation Question-Answering#
Phase I - Initial Evaluation
70 EO questions, 5 quality criteria (relevance, groundedness, helpfulness, depth, factuality).
Models:
Mistral-24B,Llama-3.3-70B.Result: Two-step RAG outperformed both zero-shot and one-step RAG in all metrics (overall $\geq 4.9 / 5$).
Phase II: Repeat evaluation above with the Open Web Search Data
Phase III: Conduct Human evaluation for the top two best models from phase I and II.
Geo-Contextualized Multi-Genre Scientific Search#
Participants: 18 expert & non-expert users.
Findings:
High acceptance of multi-genre and geo-contextual integration (avg. $4.6 / 5$).
Visual map view improved exploration.
Users requested UI simplification for large result sets.
Sustainability and ELSA#
Ethical and Legal Aspects#
All data stem from open and FAIR sources (OpenAlex, PANGAEA, DLR Geoservice).
The systems respect data attribution, open-source licenses, and scientific transparency.
LLM-based evaluations use non-sensitive, domain-specific content only.
The goal is to enhance trustworthy AI and explainable search in science, aligning with ELSA principles.
Source Code and Installation#
RAG for Earth Observation: GitHub Repo
Search Engine Prototype: zenodo.org/records/13788713
Publications#
El Baff, R., Schluckebier, B., & Hecking, T. (2025). Knowledge Graph-Enhanced Retrieval-Augmented Generation for Earth Observation Data. - DARES 2025, ECAI Workshop.
El Baff, R. & Hecking, T. (2025). SCIENTIFIC QUESTION ANSWERING USING HYBRID RETRIEVAL AUGMENTED GENERATION. OSSYM 2025
Honeder, J., El Baff, R., Hecking, T., et al. (2025). A Geo-Contextualized Multi-Genre Scientific Search Engine. ICGDA 2025.
Future and Outlook#
Continue the RAG-based model evaluation.
Integrate multimodal EO data (imagery, time series) within the RAG workflow.
Extend agentic search orchestration for autonomous data discovery
Contact#
Roxanne El Baff
Ben Schluckebier
Tobias Hecking
E-mail:
<first>.<last_name>[at]dlr.de