It is widely accepted in literature that data curation is the first step for a successful pretraining of Large, and Small, Language Models (LLMs). Datasets generally fall into two categories: open datasets are publicly available, fostering transparency, reproducibility, and community-driven improvement, but they often face limitations in scale, diversity, and quality. Closed datasets, typically curated by private entities, can offer greater scale, higher quality, and proprietary data sources, yet they raise concerns around transparency, bias auditing, and public accountability. This paper presents an experiment aimed at quantitatively measuring the improvements provided by representative datasets for LLM pretraining. We pretrained two small LLMs under the same experimental conditions as the corresponding Italian reference models from the Minerva family, evaluated their performance on standard benchmarks, and used LLM-as-a- Judge to assess the Fluency, Coherence, and Relevance of generated texts on specific tasks. The results support the idea that, while open science and open datasets are important goals, representative corpora, even if closed, are more suitable for LLM pretraining, as they enable better performance under identical experimental conditions.
Tamburini, F. (2025). Curated Data does not mean Representative Data when training Large Language Models: an Experiment using Representative Data for Italian. Aachen : CEUR Workshop Proceedings (CEUR-WS.org).
Curated Data does not mean Representative Data when training Large Language Models: an Experiment using Representative Data for Italian
Tamburini Fabio
2025
Abstract
It is widely accepted in literature that data curation is the first step for a successful pretraining of Large, and Small, Language Models (LLMs). Datasets generally fall into two categories: open datasets are publicly available, fostering transparency, reproducibility, and community-driven improvement, but they often face limitations in scale, diversity, and quality. Closed datasets, typically curated by private entities, can offer greater scale, higher quality, and proprietary data sources, yet they raise concerns around transparency, bias auditing, and public accountability. This paper presents an experiment aimed at quantitatively measuring the improvements provided by representative datasets for LLM pretraining. We pretrained two small LLMs under the same experimental conditions as the corresponding Italian reference models from the Minerva family, evaluated their performance on standard benchmarks, and used LLM-as-a- Judge to assess the Fluency, Coherence, and Relevance of generated texts on specific tasks. The results support the idea that, while open science and open datasets are important goals, representative corpora, even if closed, are more suitable for LLM pretraining, as they enable better performance under identical experimental conditions.| File | Dimensione | Formato | |
|---|---|---|---|
|
103_main_long.pdf
accesso aperto
Descrizione: Contributo in Atti di Convegno
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione
1.17 MB
Formato
Adobe PDF
|
1.17 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


