Knowledge Graph-Enhanced Earth Observation Applications#

Abstract#

We introduce two complementary applications designed to make scientific information in the Earth Observation (EO) domain more accessible and meaningful. Both tools support researchers, analysts, and data-driven decision makers in exploring and understanding EO data and knowledge.

The first application, RAG-based Model for EO Question Answering, allows users to ask complex, natural-language questions about Earth Observation and receive clear, grounded answers supported by scientific publications, datasets, and curated web sources.

The second application, Geo-Contextualized Multi-Genre Scientific Search, is an interactive web platform for exploring EO information spatially. Users can search by topic or location, visualize results on a map, and discover relationships between research papers, web content, and satellite datasets.

Together, these tools aim to simplify knowledge discovery and enhance trust in AI-assisted scientific exploration by connecting multiple information sources within an intuitive, research-oriented environment.

Use Case#

Key Functionality#

  • AI-Assisted Question Answering: Users can ask natural-language questions such as “How do wildfires affect land-surface temperature?” The system retrieves relevant scientific papers, datasets, and curated web content, then generates grounded answers.

  • Exploratory Search and Discovery: A visual interface enables users to search by topic or geographic region. Results are presented on interactive maps and linked lists, helping users explore relationships between datasets, publications, and EO missions.

  • Hybrid (lexical-/semantic- based ) Contextualization: Both systems retrieve information using a hybrid approach that combines lexical matching (BM25/Lucene) with semantic similarity. This enables precise keyword grounding while capturing conceptual relations across web and scientific sources within a unified contextual layer.

Context and Benefit#

These applications target Earth Observation scientists, data managers, and interdisciplinary researchers seeking trustworthy, explainable, and georeferenced access to EO knowledge. They bridge two common gaps:

  1. Hallucination and context loss in LLMs for domain-specific QA.

  2. Fragmentation of scientific data across heterogeneous repositories.

By connecting webdata, scientific text, datasets, and spatial context, the systems improve trust, interpretability, and discovery efficiency in EO research.

Application#

Conceptual Description and Architecture#

Both applications are built on a common data reservoir that structures, enriches, and interlinks information from diverse Earth Observation (EO) sources. At its core, the systems use two complementary data pipelines: a web index for open EO content and a multi-genre knowledge graph connecting scientific datasets and publications.

The web index provides digestible and vast information related to EO. And the knowledge graph, consisting of more than two million nodes, links resources from OpenAlex (pee-reviewed scientific publications), PANGAEA (datasets), and the DLR Geoservice (satellite data). Together, these components enable breadth and explorative access to EO information.

Below we detail the conceptual design, describing first the data curation components used by both systems, then we describe the workflow and component for each system.

Data Curation and Semantic Integration#

The data curation workflow is organized into two streams:

  • Web Data Stream - Collects and indexes relevant EO-related webpages to support retrieval-augmented generation (RAG) and enrich open-domain content, using OWILIX and MOSAIC.

  • Scientific Data Stream - Processes publications and datasets, semantically linking them through NASA’s GCMD (Science Keywords) taxonomy, one of the most established classification schemes in Earth Observation.

By merging web-based and scientific content, the system ensures both breadth and precision in knowledge discovery—supporting users in answering domain questions and exploring EO data through spatial and semantic relations.

Data Pipelines.

1. RAG-based Model for Earth Observation Question-Answering#

  • The workflow of the RAG-Based model is as follows:

    1. Data Curation (detailed above)

    2. Generation. Two-step RAG: zero-shot generation $\rightarrow$ refinement using retrieved context from (1.).

    3. Evaluation LLM-based evaluation (“LLM-as-a-Judge”) for groundedness, helpfulness, and depth.

  • Technology Stack: LangGraph, LangChain, ChromaDB, ArangoDB.

Screenshot#

  • We show below a screenshot of our EO Question-Answering model:

RAG-based LLM

  • We show below a screenshot of our EO Search Engine:

Scientific Search UI

Index Data#

As mentioned earlier, the data curation step contains a web index from the Open Web Search data, and a knowledge graph interlinking several datasources. In the table below, we summarize all the datasources we use.

Data Sources#

Source

Type

Purpose

Open Web Search (OWS)

Web documents

Blogs and grey literature for context

OpenAlex

Publications, metadata

Scientific abstracts and topics

PANGAEA

Datasets

Georeferenced Earth Science data

DLR EOC Geoservice

STAC Collections

Satellite data and EO missions

NASA GCMD

Taxonomy

Semantic linking and keyword extraction

Compilation Process#

  • TaxoTagger matches texts to NASA GCMD concepts using embedding similarity. This tool is used to filter the Open Web Search (OWS) and the OpenAlex data to match the Earth Observation theme.

  • Corpus Annotation Graph (CAG) Builder populates the knowledge graph using ArangoDB.

  • Data enriched with author, keyword, and mission metadata, then indexed for semantic and geospatial search.

Evaluation#

RAG-based Model for Earth Observation Question-Answering#

  • Phase I - Initial Evaluation

  • 70 EO questions, 5 quality criteria (relevance, groundedness, helpfulness, depth, factuality).

  • Models: Mistral-24B, Llama-3.3-70B.

  • Result: Two-step RAG outperformed both zero-shot and one-step RAG in all metrics (overall $\geq 4.9 / 5$).

  • Phase II: Repeat evaluation above with the Open Web Search Data

  • Phase III: Conduct Human evaluation for the top two best models from phase I and II.

Geo-Contextualized Multi-Genre Scientific Search#

  • Participants: 18 expert & non-expert users.

  • Findings:

    • High acceptance of multi-genre and geo-contextual integration (avg. $4.6 / 5$).

    • Visual map view improved exploration.

    • Users requested UI simplification for large result sets.

Sustainability and ELSA#

Source Code and Installation#

Publications#

Future and Outlook#

  • Continue the RAG-based model evaluation.

  • Integrate multimodal EO data (imagery, time series) within the RAG workflow.

  • Extend agentic search orchestration for autonomous data discovery

Contact#

  • Roxanne El Baff

  • Ben Schluckebier

  • Tobias Hecking

E-mail: <first>.<last_name>[at]dlr.de