This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data preprocessing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Diaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.
Giovanelli, J., Bilalli, B., Abelló, A., Silva-Coira, F., de Bernardo, G. (2024). Reproducible experiments for generating pre-processing pipelines for AutoETL. INFORMATION SYSTEMS, 120, 1-13 [10.1016/j.is.2023.102314].
Reproducible experiments for generating pre-processing pipelines for AutoETL
Giovanelli, Joseph
Primo
;
2024
Abstract
This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data preprocessing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Diaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.