Cookies disclaimer
Our site saves small pieces of text information (cookies) on your
device in order to verify your login. These cookies are essential
to provide access to resources on this website and it will not
work properly without.
Learn more
Lecture Hall

Bookings
Welcome
- Sept. 3, 2025, 13:00 – 13:30 Oral Session
Artificial Intelligence/ Machine Learning methods in Earth System Sciences
- Sept. 3, 2025, 13:30 – 15:15 Oral Session
Meeting room
Lecture Hall
Climate research often requires substantial technical expertise. This involves managing data standards, various file formats, software engineering, and high-performance computing. Translating scientific questions into code that can answer them demands significant effort. The question is, why? Data analysis platforms like Freva (Kadow et al. 2021, e.g., gems.dkrz.de ) aim to enhance user convenience, yet programming expertise is still required. In this context, we introduce a large language model setup and chat bot interface for different core e.g. based on GPT-4/ChatGPT or DeepSeek, which enables climate analysis without technical obstacles, including language barriers. Not yet, we are dealing with climate LLMs for this purpose. Dedicated natural language processing methodologies could bring this to a next level. This approach is tailored to the needs of the broader climate community, which deals with small and fast analysis to massive data sets from kilometer-scale modeling and requires a processing environment utilizing modern technologies, but addressing still society after all, such as those in the Earth Virtualization Engines (EVE - eve4climate.org ). Our interface runs on an High Performance Computer with access to PetaBytes of data - everything just a chat away.
Meeting room
Lecture Hall
The protection of critical underwater infrastructure, such as pipelines, data cables, or offshore energy assets, has become an emerging security challenge. Despite its growing importance, maritime infrastructure monitoring remains limited by high costs, insufficient coverage, and fragmented data processing workflows. The ARGUS project addresses these challenges by developing an AI-driven platform to support risk assessment and surveillance at sea.
At its core, ARGUS integrates satellite-based Synthetic Aperture Radar (SAR) imagery, AIS vessel tracking data, and spatial information on critical assets into a unified data management system. A key functionality is detecting so-called "ghost ships" – vessels that deliberately switch off their AIS transponders – using object detection techniques on SAR imagery.
At the same time, we are currently developing methods for underwater anomaly and change detection based on optical imagery. This work is still ongoing and focuses on identifying relevant structural or environmental changes in submerged infrastructure through automated image comparison and temporal analysis.
In this talk, we present the architecture and workflows of the ARGUS system, including our use of deep learning (YOLO-based object detection) in the maritime context. We share insights into the current capabilities and limitations of AI models for maritime surveillance, especially in the context of …
Meeting room
Lecture Hall
Current AI cannot function without data, yet this precious resource is often underappreciated. In the context
of machine learning, dealing with incomplete datasets are a widespread challenge. Large, consistent, and
error-free data sets are essential for an optimally trained neural network. Complete and well-structured in-
puts substantially contribute to both training, results and subsequent conclusions. As a result, using high-
quality data improves the performance and the ability of neural networks to generalize.
However, real-world datasets from field measurements can contain information leakage. Sensor failures,
maintenance issues or inconsistent data collection can cause invalid ('NaN', Not a Number) values to appear
in the neural network input matrices.
Imputation techniques are an important step in data processing for handling missing values. Estimating
'NaN' values or replacing them with plausible values directly affects the quality of the input data and thus
the effectiveness of the neural network.
In this contribution, we present a neural network-based regression model (ANN regression), that explains
the salt characteristics in the Elbe estuary. In this context, we focus on selecting appropriate imputation
strategies.
While traditional methods such as imputation by mean, median, or mode are simple and computationally
efficient, they sometimes fail to preserve the underlying data …
Meeting room
Lecture Hall
We present a comprehensive machine learning framework for predicting spatially distributed geographical data from point measurements. The framework takes as input a set of geographical features at a specified grid resolution (e.g., 5 arc-minute scale) and corresponding point measurements with their spatial coordinates and target values. The framework trains and evaluates multiple machine learning models, including both tree-based methods (Random Forest, XGBoost, CatBoost) and deep learning architectures (feed forward neural networks, TabPFN[1]), to identify the optimal predictive model for the given dataset.
The framework incorporates hyperparameter search(depth and width) for deep learning models and systematic parameter search for tree-based models (e.g., number of estimators). This ensures robust model selection and performance optimization across different geographical contexts and data characteristics. The framework outputs the best-performing model along with comprehensive performance metrics and uncertainty estimates.
As a non-trivial application, we demonstrate the framework's effectiveness in predicting total organic carbon (TOC) concentrations[2] and sedimentation rates in the ocean. This involves integrating features from both the sea surface and seafloor, encompassing a diverse array of oceanographic, geological, geographic, biological, and biogeochemical parameters. The framework successfully identifies the most suitable model architecture and hyperparameters for this complex spatial prediction task, providing both high accuracy and …
Meeting room
Lecture Hall
Urban-scale air quality data is crucial for exposure assessment and decision-making in cities. However, high-resolution Eulerian Chemistry Transport Models (CTMs) with street-scale resolutions (100 m x 100 m), while process-based and scenario-capable, are computationally expensive and require city-specific emission inventories, meteorological fields and boundary concentrations. In contrast, machine learning (ML) offers a scalable and efficient alternative to enhance spatial resolution using existing regional-scale (1 km - 10 km grid resolutions) reanalysis datasets.
We present a reproducible ML framework that downscales hourly NO 2 data from the CAMS Europe ensemble (~10 km resolution) to 100 × 100 m 2 resolution, using 11 years of data (2013–2023) for Hamburg. The framework integrates satellite-based and modelled inputs (CAMS, ERA5-Land), spatial predictors (CORINE, GHSL, OSM), and time indicators. Two ML approaches are employed: XGBoost for robust prediction and interpretability (via SHAP values), and Gaussian Processes for quantifying spatial and temporal uncertainty.
The downscaling is evaluated through random, time-based and leave-site-out validation approaches. Results demonstrate good reproduction of observed spatial and temporal NO 2 patterns, including traffic peaks and diurnal/seasonal trends. The trained models generate over 160 million hourly predictions for Hamburg with associated uncertainty fields. Although developed for Hamburg, the framework has been successfully …
Meeting room
Lecture Hall
This study presents an end-to-end deep learning framework, 4DVarNet, for reconstructing high-resolution spatiotemporal fields of suspended particulate matter (SPM) in the German Bight under realistic satellite data gaps. Using a two-phase approach, the network is first pretrained on gap-free numerical model outputs masked with synthetic cloud patterns, then fine-tuned against sparse CMEMS observations with an additional independent validation mask. The framework architecture embeds a trainable dynamical prior and a convolutional LSTM solver to iteratively minimize a cost function that balances data agreement with physical consistency. The framework is applied for one year data (2020) of real observations (CMEMS) and co-located model simulations, demonstrating robust performance under operational conditions. Reconstructions capture major spatial patterns with correlation R2 = 0.977 and 50% of errors within ± 0.2 mg/L, even when 27% of days lack any observations. Sensitivity experiments reveal that removing 60% of available data doubles RMSE and smooths fine-scale SPM spatial features. Moreover, increasing the assimilation window reduces edge discontinuities between the data-void area and the adjacent data-rich region, whereas degrades sub-daily variability. Extending 4DVarNet to higher temporal resolution (hourly) reconstruction will require incorporating tidal dynamics to account for SPM resuspension, enabling real-time sediment transport forecasting in coastal environments.
Meeting room
Lecture Hall
Data Management Workflows
- Sept. 3, 2025, 16:30 – 18:15 Oral Session
Meeting room
Lecture Hall
The Helmholtz Model Zoo (HMZ) is a cloud-based platform that provides remote access to deep learning models within the Helmholtz Association. It enables seamless inference execution via both a web interface and a REST API, lowering the barrier for scientists to integrate state-of-the-art AI models into their research.
Scientists from all 18 Helmholtz centers can contribute their models to HMZ through a streamlined, well-documented submission process on GitLab. This process minimizes effort for model providers while ensuring flexibility for diverse scientific use cases. Based on the information provided about the model, HMZ automatically generates the web interface and API, tests the model, and deploys it. The REST API further allows for easy integration of HMZ models into other computational pipelines.
With the launch of HMZ, researchers can now run AI models within the Helmholtz Cloud while keeping their data within the association. The platform imposes no strict limits on the number of inferences or the volume of uploaded data, and it supports both open-access and restricted-access model sharing. Data uploaded for inference is stored within HIFIS dCache InfiniteSpace and remains under the ownership of the uploading user.
HMZ is powered by GPU nodes equipped with four NVIDIA L40 GPUs per …
Meeting room
Lecture Hall
In 2022, GEOMAR created the Data Science Unit as its internal start-up to centralize Data Science support and activities. With up to eight data scientists as support personnel for GEOMAR, various projects and services were addressed in the following years. Now, three years since the foundation, we present lessons-learned such as the importance of on-site training programs, the challenges in balancing generalisation and customization or the varied success in achieving science-based key performance indicators.
Meeting room
Lecture Hall
Compliant with the FAIR data principles, the long-term archiving of marine seismic data acquired from active-source surveys remains a critical yet complex task within the geophysical data life cycle. Data infrastructures such as PANGAEA – Data Publisher for Earth & Environmental Science and affiliated repositories must address the increasing volume, heterogeneity, and complexity of these datasets, which are produced using a variety of acquisition systems. To support this, the German marine seismic community is actively developing metadata standards tailored to different seismic data types, enabling their proper integration and archiving in PANGAEA. In parallel, new semi-automated workflows and standard operating procedures (SOPs) are being established and implemented to ensure consistent data publication and sustainable long-term stewardship.
These advancements are being driven by the “Underway” Research Data project, a cross-institutional initiative of the German Marine Research Alliance (Deutsche Allianz Meeresforschung e.V., DAM). Initiated in mid-2019, the project aims to standardize and streamline the continuous data flow from German research vessels to open-access repositories, in alignment with FAIR data management practices. Marine seismic data curation, in particular, stands out as a successful use case for integrating expedition-based data workflows. By leveraging the tools, infrastructure, and expertise provided by the “Underway” Research Data …
Meeting room
Lecture Hall
Autonomous Underwater Vehicles (AUVs) and Remotely Operated Vehicles (ROVs) are essential tools for investigating marine environments. These large-scale platforms are equipped with a variety of sensors and systems, including CTD, fluorometers, multibeam echosounders, side-scan sonar, and camera systems. ROVs also have the capability to collect water, biological, and geological samples. As a result, the datasets acquired from these missions are highly heterogeneous, combining diverse data types that require careful handling, standardization of metadata information, and publication.
At GEOMAR, we develop and implement within the context of the Helmholtz DataHub a comprehensive workflow that spans the entire data lifecycle for large scale facilities.
It combines using the infrastructures of O2A Registry for device management, Ocean Science Information System (OSIS) for cruise information, PANGAEA for data publication and the portal earth-data.de for future visualization of AUV and ROV missions.
The presented workflow is currently deployed for GEOMAR’s REMUS6000 AUV "Abyss", and is being designed with scalability in mind, enabling its future application to other AUVs and ROVs.
Meeting room
Lecture Hall
The German research vessels Alkor, Elisabeth Mann Borgese, Heincke, Maria S. Merian, Meteor, Polarstern and Sonne steadily provide oceanographic, meteorological and other data to the scientific community. However, accessing and integrating time series raw data from these platforms has traditionally been fragmented and technically challenging. The newly deployed DSHIP Land System addresses this issue by consolidating time series data from marine research vessels into a unified and scalable data warehouse.
At its core, the new system stores raw measurement data in the efficient and open Apache Parquet format. These columnar storage files allow for rapid querying and filtering of large datasets. To ensure flexible and high-performance access, the system uses a Trino SQL query engine running on a Kubernetes cluster composed of three virtual machines. This setup can be elastically scaled to meet variable demand, enabling efficient data access even under high load.
This talk will briefly introduce the technical foundations of the DSHIP Land System, highlight the choice of storage format, the architecture of the Trino engine, and its deployment in a containerized Kubernetes environment. The focus will then shift to a demonstration how users can interactively query the datasets using standard SQL, enabling cross-vessel data exploration, filtering by …
Meeting room
Lecture Hall
The Baltic Sea is a semi-enclosed shelf sea and characterized by its distinct geographical and oceanographic features. One of the Baltic’s most remarkable features is its surface salinity gradient that is horizontally decreasing from the saline North Sea to the near fresh Bothnian Sea in the north, and Gulf of Finland in the east. Additionally, a vertical gradient and strong stratification separate between less saline surface water and deep saline water. These salinity features are mainly driven by a combination of river runoff, net precipitation, wind conditions, and geographic features that lead to restricted and irregular inflow of saltwater into the Baltic and limited mixing. The overall positive freshwater balance causes the Baltic to be much fresher compared to fully marine ocean waters with a mean salinity of only about 7 g/kg. The Baltic Sea is particularly sensitive to climate change and global warming due to its shallowness, small volume and limited exchange with the world oceans. Consequently, it is changing more rapidly than other regions. Recent changes in salinity are less clear due to a high variability but overall surface salinity seems to decrease with a simultaneous increase in the deeper water layers. Furthermore. the overall salinity distribution is …
Meeting room
Lecture Hall
The growing complexity of digital research environments and the explosive increase in data volume demand robust, interoperable infrastructures to support sustainable Research Data Management (RDM). In this context, data spaces have emerged—especially in industry—as a powerful conceptual framework for organizing and sharing data across ecosystems, institutional boundaries, and disciplines. Although the term is not yet fully established in the research community, it maps naturally onto scientific practice, where the integration of heterogeneous datasets and cross-disciplinary collaboration are increasingly central.
Aligned with the principles of open science, FAIR Digital Objects (FDOs) provide a promising infrastructure for structuring these emerging data spaces. FDOs are standardized, autonomous, and machine-actionable digital entities that encapsulate data, metadata, software, and semantic assertions. They enable both humans and machines to Find, Access, Interoperate, and Reuse (FAIR) digital resources efficiently. By abstracting from underlying technologies and embedding persistent, typed relations, FDOs allow for seamless data integration, provenance tracking, and rights management across domains. This structure promotes reproducibility, trust, and long-term sustainability in data sharing.
Using an example from climate research, we demonstrate how data from from different data spaces can be combined. By employing STACs (Spatio Temporal Asset Catalogs) defined as FAIR Digital Objects facilitating the European Open …
Meeting room
Lecture Hall
Metadata
- Sept. 4, 2025, 09:00 – 10:15 Oral Session
Meeting room
Lecture Hall
To ensure FAIR data (Wilkinson et al., 2016: https://doi.org/10.1038/sdata.2016.18 ), well-described datasets with rich metadata are essential for interoperability and reusability. In Earth System Science, NetCDF is the quasi-standard for storing multidimensional data, supported by metadata conventions such as Climate and Forecast (CF, https://cfconventions.org/ ) and Attribute Convention for Data Discovery (ACDD, https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3 ).
While NetCDF can be self-describing, metadata often lacks compatibility and completeness for repositories and data portals. The Helmholtz Metadata Guideline for NetCDF (HMG NetCDF) Initiative addresses these issues by establishing a standardized NetCDF workflow. This ensures seamless metadata integration into downstream processes and enhances AI-readiness.
A consistent metadata schema benefits the entire processing chain. We demonstrate this by integrating enhanced NetCDF profiles into selected clients like the Earth Data Portal (EDP, https://earth-data.de ). Standardized metadata practices facilitate repositories such as PANGAEA ( https://www.pangaea.de/ ) and WDCC ( https://www.wdc-climate.de ), ensuring compliance with established norms.
The HMG NetCDF Initiative is a collaborative effort across German research centers, supported by the Helmholtz DataHub. It contributes to broader Helmholtz efforts (e.g., HMC) to improve research data management, discoverability, and interoperability.
Key milestones include:
- Aligning metadata fields across disciplines,
- Implementing guidelines,
- Developing machine-readable templates and validation tools,
- Supporting user-friendly metadata …
Meeting room
Lecture Hall
There is an increasing effort in scientific communities to create shared vocabularies and ontologies. These build the foundation of a semantically annotated knowledge graph which can surface all research data and enable holistic data analysis across various data sources and research domains.
Making machine-generated data available in such a knowledge graph is typically done by setting up scripts and data transformation pipelines which automatically add semantic annotations. Unfortunately, a good solution for capturing manually recorded (meta)data in such a knowledge graph is still lacking.
Herbie, the semantic electronic lab notebook and research database developed at Hereon, fills this gap. In Herbie, users can enter all (meta)data on their experiments in customized web forms. And once submitted, Herbie automatically adds semantic annotations and stores everything directly in the knowledge graph. So it is as easy to use as a spreadsheet but produces FAIR data without any additional post-processing work. Herbie is configured using the standardized SHACL Shapes Constraint Language and furthermore builds on well-established frameworks in the RDF ecosystem like RDFS, OWL, or RO-Crate.
We will showcase this approach through a typical example of a production and analysis chain as can be found in many scientific domains.
Meeting room
Lecture Hall
The collection and use of sensor data are vital for scientists monitoring the Earth's environment. It allows for the evaluation of natural phenomena over time and is essential for validating experiments and simulations. Assessing data quality requires understanding the sensor's state, including operation and maintenance, such as calibration parameters and maintenance schedules. In the HMC project MOIN4Herbie, digital recording of FAIR sensor maintenance metadata is developed using the electronic lab notebook Herbie.
In this talk, we will describe the process of configuring Herbie with ontology-based forms for sensor maintenance metadata in our two pilot cases, the Boknis Eck underwater observatory and the Tesperhude research platform. This includes the development of a sensor maintenance ontology and task-specific ontologies tailored for each use case. Ontologies, in information science, are a formalization of concepts, their relations, and properties. They allow for the collection of input that is immediately fit for purpose as findable, machine-readable, and interoperable metadata. By using ontologies, we can ensure the use of controlled vocabularies and organize the knowledge stored within for accessibility and reusability.
A further focus will be the translation of maintenance tasks into Shapes Constraint Language (SHACL) documents that can be rendered as forms to the users …
Meeting room
Lecture Hall
Digital Twins
- Sept. 4, 2025, 10:45 – 11:45 Oral Session
Meeting room
Lecture Hall
Technology is revolutionizing our approach to environmental challenges. Among the most promising tools of digitalization is the Digital Twin (DT), or more specifically the Digital Twin of the Ocean (DTO). This is a virtual replica of the ocean that holds immense potential for sustainable marine development. In order to successfully confront the increasing impacts and hazards of a changing climate (such as coastal erosion and flooding), it is vital to further develop the DTO in order to be able to monitor, predict, and protect vulnerable coastal communities. DTOs are powered by AI-enhanced data that integrates ocean conditions, ecosystems, and anthropogenic influences, along with novel AI-driven predictive modeling capabilities, combining wave, hydrodynamic, and morphodynamic models. This enables unprecedented accuracy in seamless forecasting capabilities. In addition to natural phenomena, DTOs can also include socio-economic factors (e.g. ocean-use, pollution). Thus, DTOs can be used to monitor the current ocean state, but also to simulate future ‘What-if’ Scenarios (WiS) for various human interventions. In this way the DTO can guide decisions for protecting the coast and sustainable use of marine resources, while also promoting collaboration on effective solutions for ocean conservation.
In European projects such as the European Digital Twin Ocean (EDITO) ModelLab, work …
Meeting room
Lecture Hall
Digital twins of the ocean (DTO) make marine data available to support the development of the blue economy and enable a direct interaction through bi-directional components. Typical DTOs provide insufficient detail near the coast, because their resolution is too coarse and the underlying models lack processes that become relevant in shallow areas, e.g., at wetting and drying of tidal flats. As roughly 2.13 Billion of the world’s population live near a coast, downscaling ocean information to a local scale becomes necessary, as many practical applications, e.g., sediment management, require high resolution data. For this reason, we focused on the appropriate downscaling of regional and global data from existing DTOs using a high-resolution (100s of meters), unstructured, three-dimensional, process-based hindcast model in combination with in-situ observations. This high-resolution model allows the fine tidal channels, estuaries, and coastal structures like dams and flood barriers to be represented digitally. Our digital twin includes tidal dynamics, salinity, sea water temperature, waves, and suspended sediment transport. Thanks to a fast and intuitive web interface of our prototype digital twin, the model data provided enable a wide range of coastal applications and support sustainable management. Bi-directional web processing services (WPS) were implemented within the interactive web-viewer …
Meeting room
Lecture Hall
The rapid growth of offshore wind energy requires effective decision-support tools to optimize operations and manage risks. To address this, we developed iSeaPower, a web-based platform designed to support decision-making in offshore renewable energy tasks through real-time data analysis and interactive visualizations. iSeaPower integrates detailed meteorological and oceanographic data with advanced statistical methods, machine learning forecasts, and data assimilation techniques. This integration enables accurate predictions of weather windows, thorough risk assessments, and efficient operational planning for offshore wind energy stakeholders. iSeaPower is designed to optimize journey planning by considering weather conditions and travel duration. The current framework includes five methods tailored to different operational requirements. First, the forecasting method evaluates wind speed and wave height risks over short-term windows (1–3 days) using real-time weather data to quickly identify potential hazards. Second, historical database analysis calculates exceedance probabilities based on 30-day intervals from long-term historical data, revealing recurring weather risk patterns. Third, the delay time estimation method determines potential task delays across the entire year by analyzing monthly weather trends, supporting long-term operational planning and risk management. Fourth, machine learning approaches enhance the accuracy of seven-day forecasts by combining historical data with machine learning, improving short-term predictions. Finally, the updated statistics …
Meeting room
Lecture Hall
Farewell
- Sept. 4, 2025, 11:45 – 12:00 Organizational Items