Comparative study to handle missing data in a machine learning model of tidal salinities

Cookies disclaimer

Our site saves small pieces of text information (cookies) on your device in order to verify your login. These cookies are essential to provide access to resources on this website and it will not work properly without. Learn more

Talk
In session Artificial Intelligence/ Machine Learning methods in Earth System Sciences , Sept. 3, 2025, 13:30 – 15:15
Exact timing: 14:15 – 14:30
Room info: Lecture Hall

Lauer, Franziska¹

, Gusak, G.² , Kösters, F.¹

Bundesanstalt für Wasserbau - Hamburg
Ocean and Climate Physics, Universität Hamburg

Current AI cannot function without data, yet this precious resource is often underappreciated. In the context
of machine learning, dealing with incomplete datasets are a widespread challenge. Large, consistent, and
error-free data sets are essential for an optimally trained neural network. Complete and well-structured in-
puts substantially contribute to both training, results and subsequent conclusions. As a result, using high-
quality data improves the performance and the ability of neural networks to generalize.
However, real-world datasets from field measurements can contain information leakage. Sensor failures,
maintenance issues or inconsistent data collection can cause invalid ('NaN', Not a Number) values to appear
in the neural network input matrices.
Imputation techniques are an important step in data processing for handling missing values. Estimating
'NaN' values or replacing them with plausible values directly affects the quality of the input data and thus
the effectiveness of the neural network.

In this contribution, we present a neural network-based regression model (ANN regression), that explains
the salt characteristics in the Elbe estuary. In this context, we focus on selecting appropriate imputation
strategies.
While traditional methods such as imputation by mean, median, or mode are simple and computationally
efficient, they sometimes fail to preserve the underlying data distribution and relationships.
More sophisticated statistical approaches, such as k-nearest neighbors (k-NN), iterative imputation methods
like multiple imputation by chained equations (MICE), and matrix completion algorithms, help to replace
'NaN' values with more accurate estimates. Imputation can also be achieved using integrative machine learn-
ing models in advance. These approaches make it possible to take complex, non-linear relationships within
the data into account.
We here demonstrate which imputation techniques for ANN regression of salt characteristics in the Elbe
estuary distort the input matrices the least and improve model accuracy the most.

Apart from the aforementioned data preprocessing steps for the input matrices, we investigate further strat-
egies in which neural architectures are adapted without explicit imputation.
An essential aspect here is the use of masking vectors, which indicate the presence, absence, or substitution
of data points. This allows the ANN regression model to weight real and substituted values appropriately
during training. In addition, techniques such as partial input masks, feature dropout, or developing models
that are inherently robust to missing data are possible alternatives.
Here, we compare what proves most promising when dealing with incomplete data: a complex data prepro-
cessing workflow or direct adaptation of the neural network.

This contribution provides a comprehensive overview of the methods and considerations that play a role in
the imputation of 'NaN' values in input matrices for our machine learning-based solutions. We have learned
that imputation strategies depend on specific data types and missing value patterns. This study emphasizes
the importance of selecting an appropriate imputation strategy while taking computational efficiency, data
integrity, and model robustness into account. This can improve the robustness and reliability of the ANN
analyses when dealing with incomplete data.

Push notifications in your browser are not yet configured.

Data Science Symposium 2025

Artificial Intelligence/ Machine Learning methods in Earth System Sciences Oral Session

Lecture Hall

Comparative study to handle missing data in a machine learning model of tidal salinities

Cookies disclaimer

Data Science Symposium 2025

Artificial Intelligence/ Machine Learning methods in Earth System Sciences Oral Session

Connection details

Lecture Hall

Cookies disclaimer