Oral Session

Data Management Workflows

Sept. 3, 2025, 16:30 – 18:15
Room info: Lecture Hall

Timetable

Sept. 3, 2025

The Helmholtz Model Zoo (HMZ) is a cloud-based platform that provides remote access to deep learning models within the Helmholtz Association. It enables seamless inference execution via both a web interface and a REST API, lowering the barrier for scientists to integrate state-of-the-art AI models into their research.

Scientists from all 18 Helmholtz centers can contribute their models to HMZ through a streamlined, well-documented submission process on GitLab. This process minimizes effort for model providers while ensuring flexibility for diverse scientific use cases. Based on the information provided about the model, HMZ automatically generates the web interface and API, tests the model, and deploys it. The REST API further allows for easy integration of HMZ models into other computational pipelines.

With the launch of HMZ, researchers can now run AI models within the Helmholtz Cloud while keeping their data within the association. The platform imposes no strict limits on the number of inferences or the volume of uploaded data, and it supports both open-access and restricted-access model sharing. Data uploaded for inference is stored within HIFIS dCache InfiniteSpace and remains under the ownership of the uploading user.

HMZ is powered by GPU nodes equipped with four NVIDIA L40 GPUs per …

Meeting room

Lecture Hall

In 2022, GEOMAR created the Data Science Unit as its internal start-up to centralize Data Science support and activities. With up to eight data scientists as support personnel for GEOMAR, various projects and services were addressed in the following years. Now, three years since the foundation, we present lessons-learned such as the importance of on-site training programs, the challenges in balancing generalisation and customization or the varied success in achieving science-based key performance indicators.

Meeting room

Lecture Hall

Compliant with the FAIR data principles, the long-term archiving of marine seismic data acquired from active-source surveys remains a critical yet complex task within the geophysical data life cycle. Data infrastructures such as PANGAEA – Data Publisher for Earth & Environmental Science and affiliated repositories must address the increasing volume, heterogeneity, and complexity of these datasets, which are produced using a variety of acquisition systems. To support this, the German marine seismic community is actively developing metadata standards tailored to different seismic data types, enabling their proper integration and archiving in PANGAEA. In parallel, new semi-automated workflows and standard operating procedures (SOPs) are being established and implemented to ensure consistent data publication and sustainable long-term stewardship.

These advancements are being driven by the “Underway” Research Data project, a cross-institutional initiative of the German Marine Research Alliance (Deutsche Allianz Meeresforschung e.V., DAM). Initiated in mid-2019, the project aims to standardize and streamline the continuous data flow from German research vessels to open-access repositories, in alignment with FAIR data management practices. Marine seismic data curation, in particular, stands out as a successful use case for integrating expedition-based data workflows. By leveraging the tools, infrastructure, and expertise provided by the “Underway” Research Data …

Meeting room

Lecture Hall

Autonomous Underwater Vehicles (AUVs) and Remotely Operated Vehicles (ROVs) are essential tools for investigating marine environments. These large-scale platforms are equipped with a variety of sensors and systems, including CTD, fluorometers, multibeam echosounders, side-scan sonar, and camera systems. ROVs also have the capability to collect water, biological, and geological samples. As a result, the datasets acquired from these missions are highly heterogeneous, combining diverse data types that require careful handling, standardization of metadata information, and publication.
At GEOMAR, we develop and implement within the context of the Helmholtz DataHub a comprehensive workflow that spans the entire data lifecycle for large scale facilities.
It combines using the infrastructures of O2A Registry for device management, Ocean Science Information System (OSIS) for cruise information, PANGAEA for data publication and the portal earth-data.de for future visualization of AUV and ROV missions.
The presented workflow is currently deployed for GEOMAR’s REMUS6000 AUV "Abyss", and is being designed with scalability in mind, enabling its future application to other AUVs and ROVs.

Meeting room

Lecture Hall

The German research vessels Alkor, Elisabeth Mann Borgese, Heincke, Maria S. Merian, Meteor, Polarstern and Sonne steadily provide oceanographic, meteorological and other data to the scientific community. However, accessing and integrating time series raw data from these platforms has traditionally been fragmented and technically challenging. The newly deployed DSHIP Land System addresses this issue by consolidating time series data from marine research vessels into a unified and scalable data warehouse.

At its core, the new system stores raw measurement data in the efficient and open Apache Parquet format. These columnar storage files allow for rapid querying and filtering of large datasets. To ensure flexible and high-performance access, the system uses a Trino SQL query engine running on a Kubernetes cluster composed of three virtual machines. This setup can be elastically scaled to meet variable demand, enabling efficient data access even under high load.

This talk will briefly introduce the technical foundations of the DSHIP Land System, highlight the choice of storage format, the architecture of the Trino engine, and its deployment in a containerized Kubernetes environment. The focus will then shift to a demonstration how users can interactively query the datasets using standard SQL, enabling cross-vessel data exploration, filtering by …

Meeting room

Lecture Hall

The Baltic Sea is a semi-enclosed shelf sea and characterized by its distinct geographical and oceanographic features. One of the Baltic’s most remarkable features is its surface salinity gradient that is horizontally decreasing from the saline North Sea to the near fresh Bothnian Sea in the north, and Gulf of Finland in the east. Additionally, a vertical gradient and strong stratification separate between less saline surface water and deep saline water. These salinity features are mainly driven by a combination of river runoff, net precipitation, wind conditions, and geographic features that lead to restricted and irregular inflow of saltwater into the Baltic and limited mixing. The overall positive freshwater balance causes the Baltic to be much fresher compared to fully marine ocean waters with a mean salinity of only about 7 g/kg. The Baltic Sea is particularly sensitive to climate change and global warming due to its shallowness, small volume and limited exchange with the world oceans. Consequently, it is changing more rapidly than other regions. Recent changes in salinity are less clear due to a high variability but overall surface salinity seems to decrease with a simultaneous increase in the deeper water layers. Furthermore. the overall salinity distribution is …

Meeting room

Lecture Hall

The growing complexity of digital research environments and the explosive increase in data volume demand robust, interoperable infrastructures to support sustainable Research Data Management (RDM). In this context, data spaces have emerged—especially in industry—as a powerful conceptual framework for organizing and sharing data across ecosystems, institutional boundaries, and disciplines. Although the term is not yet fully established in the research community, it maps naturally onto scientific practice, where the integration of heterogeneous datasets and cross-disciplinary collaboration are increasingly central.

Aligned with the principles of open science, FAIR Digital Objects (FDOs) provide a promising infrastructure for structuring these emerging data spaces. FDOs are standardized, autonomous, and machine-actionable digital entities that encapsulate data, metadata, software, and semantic assertions. They enable both humans and machines to Find, Access, Interoperate, and Reuse (FAIR) digital resources efficiently. By abstracting from underlying technologies and embedding persistent, typed relations, FDOs allow for seamless data integration, provenance tracking, and rights management across domains. This structure promotes reproducibility, trust, and long-term sustainability in data sharing.

Using an example from climate research, we demonstrate how data from from different data spaces can be combined. By employing STACs (Spatio Temporal Asset Catalogs) defined as FAIR Digital Objects facilitating the European Open …

Meeting room

Lecture Hall