Reproducible experiments for generating pre-processing pipelines for AutoETL

Giovanelli, Joseph; Bilalli, Besim; Abelló, Alberto; Silva-Coira, Fernando; de Bernardo, Guillermo

doi:10.1016/j.is.2023.102314

This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data preprocessing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Diaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.

Giovanelli, J., Bilalli, B., Abelló, A., Silva-Coira, F., de Bernardo, G. (2024). Reproducible experiments for generating pre-processing pipelines for AutoETL. INFORMATION SYSTEMS, 120, 1-13 [10.1016/j.is.2023.102314].

Reproducible experiments for generating pre-processing pipelines for AutoETL

Giovanelli, Joseph^Primo;Bilalli, Besim;Abelló, Alberto;Silva-Coira, Fernando;de Bernardo, Guillermo

2024

Abstract

This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data preprocessing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Diaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Rivista
	
				INFORMATION SYSTEMS
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.is.2023.102314
			
	Citazione
	
				Giovanelli, J., Bilalli, B., Abelló, A., Silva-Coira, F., de Bernardo, G. (2024). Reproducible experiments for generating pre-processing pipelines for AutoETL. INFORMATION SYSTEMS, 120, 1-13 [10.1016/j.is.2023.102314].
			
	Tutti gli autori
	
						Giovanelli, Joseph; Bilalli, Besim; Abelló, Alberto; Silva-Coira, Fernando; de Bernardo, Guillermo

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/970717

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

CRIS Current Research Information System