Streaming Approach to Schema Profiling

Forresi, Chiara; Francia, Matteo; Gallinucci, Enrico; Golfarelli, Matteo

doi:10.1007/978-3-031-42941-5_19

Schema profiling consists in producing key insights about the schema of data in a high-variety context. In this paper, we present a streaming approach to schema profiling, where heterogeneous data is continuously ingested from multiple sources, as is typical in many IoT applications (e.g., with multiple devices or applications dynamically logging messages). The produced profile is a clustering of the schemas extracted from the data and it is computed and evolved in real-time under the overlapping sliding window paradigm. The approach is based on two-phase k-means clustering, which entails pre-aggregating the data into a coreset and incrementally updating the previous clustering results without recomputing it in every iteration. Differently from previous proposals, the approach works in a domain where dimensionality is variable and unknown apriori, it automatically selects the optimal number of clusters, and detects cluster evolution by minimizing the need to recompute the profile. The experimental evaluation demonstrated the effectiveness and efficiency of the approach against the naïve baseline and the state-of-the-art algorithms on stream clustering.

Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2023). Streaming Approach to Schema Profiling. Cham : Springer [10.1007/978-3-031-42941-5_19].

Streaming Approach to Schema Profiling

Forresi, Chiara;Francia, Matteo;Gallinucci, Enrico;Golfarelli, Matteo

2023

Abstract

Schema profiling consists in producing key insights about the schema of data in a high-variety context. In this paper, we present a streaming approach to schema profiling, where heterogeneous data is continuously ingested from multiple sources, as is typical in many IoT applications (e.g., with multiple devices or applications dynamically logging messages). The produced profile is a clustering of the schemas extracted from the data and it is computed and evolved in real-time under the overlapping sliding window paradigm. The approach is based on two-phase k-means clustering, which entails pre-aggregating the data into a coreset and incrementally updating the previous clustering results without recomputing it in every iteration. Differently from previous proposals, the approach works in a domain where dimensionality is variable and unknown apriori, it automatically selects the optimal number of clusters, and detects cluster evolution by minimizing the need to recompute the profile. The experimental evaluation demonstrated the effectiveness and efficiency of the approach against the naïve baseline and the state-of-the-art algorithms on stream clustering.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Titolo del volume
	
				New Trends in Database and Information Systems. ADBIS 2023.
			
	Pagina iniziale
	
				211
			
	Pagina finale
	
				220
			
	Collana/Serie
	
				COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-031-42941-5_19
			
	Citazione
	
				Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2023). Streaming Approach to Schema Profiling. Cham : Springer [10.1007/978-3-031-42941-5_19].
			
	Tutti gli autori
	
						Forresi, Chiara; Francia, Matteo; Gallinucci, Enrico; Golfarelli, Matteo
					
	Appare nelle tipologie:
	
				4.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ADBIS_2023_Streaming_Profiling__short_.pdf Open Access dal 01/09/2024 Tipo: Postprint / Author's Accepted Manuscript (AAM) - versione accettata per la pubblicazione dopo la peer-review Licenza: Licenza per accesso libero gratuito Dimensione 601.31 kB Formato Adobe PDF Visualizza/Apri	601.31 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/942473

Citazioni

ND

4

ND

1

CRIS Current Research Information System

Streaming Approach to Schema Profiling

Forresi, Chiara;Francia, Matteo;Gallinucci, Enrico;Golfarelli, Matteo

2023

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CRIS Current Research Information System

Streaming Approach to Schema Profiling

Forresi, Chiara;Francia, Matteo;Gallinucci, Enrico;Golfarelli, Matteo

2023

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)