Poster

  • 15:45 – 16:30
Community members

Can you Predict the Data? A Workshop on Reproducibility for Language Modelling

Poster
In session Postersession No. 1 , Sept. 3, 2025, 15:45 – 16:30
Exact timing: 15:45 – 16:30

Brandizzi, Nicolo1ORCID iD icon , Saleem, Q.1 , Janz, A.2ORCID iD icon , Sandfeld, S.2ORCID iD icon , Leveling, J.1ORCID iD icon
  1. Fraunhofer IAIS
  2. Forschungszentrum Jülich - IAS 9

The scientific landscape is continually shifting towards increasing amounts of data, demanding a greater investment of (time-) resources into the management, and (pre-) processing of these data. As a result, data literacy has become a key element for researchers from all domains. Additionally, interdisciplinary, multidisciplinary, and collaborative approaches are more essential now than ever before. The Rhine-Ruhr Center for Scientific Data Literacy (DKZ.2R) focuses on a combined methodological data literacy, integrating data science and machine learning skills, high performance computing and research data management competencies. Our main objective is to promote a holistic data literacy offering support for researchers in the form of trainings, consultings, data challenges and tools for data analysis and management.

The availability of ever larger and more complex amounts of data requires comprehensive and methodological skills that researchers must often learn independently. These skills begin with the consideration of how scientific data should be collected, extending to questions about data processing applications, methods, infrastructure, and finally, publishing. The DKZ.2R focuses on offering support for researchers to break through data related hurdles in order to find cross-domain solutions and synergies.

In our contribution we are presenting our workflow on the filtering of training data for Foundation Models which is to be applied as a use case in data challenges.
Foundation Model performance depends directly on training data quality. Public data sources like Common Crawl contain significant noise, creating a bottleneck for developing specialized models. We address this by presenting a complete HPC workflow that enables researchers to curate web-scale text data and measure the impact on model performance.

Our workflow equips users to process a 230 GB, 89-million-document German dataset from Common Crawl. Users analyze pre-computed quality signals, design custom filters to remove low-quality content, and execute a pipeline that automates data preparation. This pipeline handles metadata removal, data merging, indexing, and tokenization. It integrates the modalities framework for distributed training on GPUs and uses poetry for environment management, making the HPC operations reproducible.

We will apply this workflow as an interactive data challenge in which the participants are presented with a specific task:
- Start with the 89 million raw documents and their quality signals.
- Design a filter that selects a high-quality subset of approximately 20 million documents, aiming to remove around 80% of the original data.
- Execute the automated HPC pipeline to process the filtered subset.
- Train a large language model (LLM) on the curated data and evaluate it to achieve a performance score on the Hellaswag benchmark that exceeds our random-sampling baseline.

This demonstration offers a tangible blueprint for data-centric AI projects. We invite attendees to interact with the system, discuss strategies for large-scale data QA/QC (Quality Assurance/Quality Control), and explore how to adapt the framework for other scientific domains. The project directly embodies the symposium's goal of building bridges between an application (LLMs), a method (data filtering), and the infrastructure (HPC) required to connect them. The DKZ.2R is therefore aiming to make a meaningful and lasting impact by supporting researchers in promoting their methodological Data Literacy.