A data lake is a loosely-structured collection of data at scale built for analysis purposes that is initially fed with almost no requirement of data quality. This approach aims at eliminating any effort before the actual exploitation of data, but the problem is only delayed since robust and defensible data analysis can only be performed after very complex data preparation activities. In this paper, we address this problem by proposing a novel and general approach to data curation in data lakes based on: (i) the specification of integrity constraints over a conceptual representation of the data lake and (ii) the automatic translation and enforcement of such constraints over the actual data. We discuss the advantages of this idea and the challenges behind its implementation.

Conceptual Constraints for Data Quality in Data Lakes / Paolo Ciaccia, Davide Martinenghi, Riccardo Torlone. - ELETTRONICO. - 3340:(2022), pp. 111-122. (Intervento presentato al convegno 1st Italian Conference on Big Data and Data Science (ITADATA 2022) tenutosi a Milan, Italy nel September 20-21, 2022).

Conceptual Constraints for Data Quality in Data Lakes

Paolo Ciaccia;
2022

Abstract

A data lake is a loosely-structured collection of data at scale built for analysis purposes that is initially fed with almost no requirement of data quality. This approach aims at eliminating any effort before the actual exploitation of data, but the problem is only delayed since robust and defensible data analysis can only be performed after very complex data preparation activities. In this paper, we address this problem by proposing a novel and general approach to data curation in data lakes based on: (i) the specification of integrity constraints over a conceptual representation of the data lake and (ii) the automatic translation and enforcement of such constraints over the actual data. We discuss the advantages of this idea and the challenges behind its implementation.
2022
ITADATA2022: The 1st Italian Conference on Big Data and Data Science
111
122
Conceptual Constraints for Data Quality in Data Lakes / Paolo Ciaccia, Davide Martinenghi, Riccardo Torlone. - ELETTRONICO. - 3340:(2022), pp. 111-122. (Intervento presentato al convegno 1st Italian Conference on Big Data and Data Science (ITADATA 2022) tenutosi a Milan, Italy nel September 20-21, 2022).
Paolo Ciaccia, Davide Martinenghi, Riccardo Torlone
File in questo prodotto:
File Dimensione Formato  
paper34.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 1.2 MB
Formato Adobe PDF
1.2 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/918468
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact