A novel system to integrate DSHIP data from marine research vessels

Talk
In session Data Management Workflows , Sept. 3, 2025, 16:30 – 18:15
Exact timing: 17:30 – 17:45
Room info: Lecture Hall

Betz, Maximilian1ORCID iD icon , Anselm, N.1ORCID iD icon , Glöckner, F. O.1 , Immoor, S.1ORCID iD icon
  1. Alfred-Wegener-Institut - Helmholtz-Zentrum für Polar- und Meeresforschung

The German research vessels Alkor, Elisabeth Mann Borgese, Heincke, Maria S. Merian, Meteor, Polarstern and Sonne steadily provide oceanographic, meteorological and other data to the scientific community. However, accessing and integrating time series raw data from these platforms has traditionally been fragmented and technically challenging. The newly deployed DSHIP Land System addresses this issue by consolidating time series data from marine research vessels into a unified and scalable data warehouse.

At its core, the new system stores raw measurement data in the efficient and open Apache Parquet format. These columnar storage files allow for rapid querying and filtering of large datasets. To ensure flexible and high-performance access, the system uses a Trino SQL query engine running on a Kubernetes cluster composed of three virtual machines. This setup can be elastically scaled to meet variable demand, enabling efficient data access even under high load.

This talk will briefly introduce the technical foundations of the DSHIP Land System, highlight the choice of storage format, the architecture of the Trino engine, and its deployment in a containerized Kubernetes environment. The focus will then shift to a demonstration how users can interactively query the datasets using standard SQL, enabling cross-vessel data exploration, filtering by time ranges and geospatial boundaries, and joining with external datasets. Finally, a brief outlook on the current status and future data integration is given.

By making time series data easily accessible and queryable, the DSHIP Land System opens new opportunities for data-driven interdisciplinary environmental research. It enables reproducible AI-ready workflows and long-term data integration across missions and platforms.