CRIS Current Research Information System

Data pre-processing plays a key role in a data analytics process (e.g., supervised learning). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations (e,g., normalization, discretization, etc.) impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these pipelines can be optimized using AutoML in order to generate executable pipelines (i.e., with parametrized operators for each transformation). In this work, we study the impact of transformations in general, and the impact of transformations when combined together into pipelines. We develop a generic method that allows to find effective pipeline prototypes. Evaluated using Scikit-learn, our effective pipeline prototypes, when optimized, provide results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller.

Giovanelli J., Bilalli B., Abello A. (2021). Effective data pre-processing for AutoML. CEUR-WS.

Effective data pre-processing for AutoML

Giovanelli J.^{Primo

Software};Bilalli B.;Abello A.

2021

Abstract

Data pre-processing plays a key role in a data analytics process (e.g., supervised learning). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations (e,g., normalization, discretization, etc.) impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these pipelines can be optimized using AutoML in order to generate executable pipelines (i.e., with parametrized operators for each transformation). In this work, we study the impact of transformations in general, and the impact of transformations when combined together into pipelines. We develop a generic method that allows to find effective pipeline prototypes. Evaluated using Scikit-learn, our effective pipeline prototypes, when optimized, provide results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Titolo del volume
	
				DOLAP 2021 - Design, Optimization, Languages and Analytical Processing of Big Data
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				10
			
	Collana/Serie
	
				CEUR WORKSHOP PROCEEDINGS
			
	Citazione
	
				Giovanelli J.,  Bilalli B.,  Abello A. (2021). Effective data pre-processing for AutoML. CEUR-WS.
			
	Tutti gli autori
	
						Giovanelli J.; Bilalli B.; Abello A.
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
dolap.pdf accesso aperto Tipo: Versione (PDF) editoriale Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 2.56 MB Formato Adobe PDF Visualizza/Apri	2.56 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/838881

Citazioni

ND

7

ND

social impact