Schema profiling consists in producing key insights about the schema of data in a high-variety context. In this paper, we present a streaming approach to schema profiling, where heterogeneous data is continuously ingested from multiple sources, as is typical in many IoT applications (e.g., with multiple devices or applications dynamically logging messages). The produced profile is a clustering of the schemas extracted from the data and it is computed and evolved in real-time under the overlapping sliding window paradigm. The approach is based on two-phase k-means clustering, which entails pre-aggregating the data into a coreset and incrementally updating the previous clustering results without recomputing it in every iteration. Differently from previous proposals, the approach works in a domain where dimensionality is variable and unknown apriori, it automatically selects the optimal number of clusters, and detects cluster evolution by minimizing the need to recompute the profile. The experimental evaluation demonstrated the effectiveness and efficiency of the approach against the naïve baseline and the state-of-the-art algorithms on stream clustering.

Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2023). Streaming Approach to Schema Profiling. Cham : Springer [10.1007/978-3-031-42941-5_19].

Streaming Approach to Schema Profiling

Forresi, Chiara;Francia, Matteo;Gallinucci, Enrico
;
Golfarelli, Matteo
2023

Abstract

Schema profiling consists in producing key insights about the schema of data in a high-variety context. In this paper, we present a streaming approach to schema profiling, where heterogeneous data is continuously ingested from multiple sources, as is typical in many IoT applications (e.g., with multiple devices or applications dynamically logging messages). The produced profile is a clustering of the schemas extracted from the data and it is computed and evolved in real-time under the overlapping sliding window paradigm. The approach is based on two-phase k-means clustering, which entails pre-aggregating the data into a coreset and incrementally updating the previous clustering results without recomputing it in every iteration. Differently from previous proposals, the approach works in a domain where dimensionality is variable and unknown apriori, it automatically selects the optimal number of clusters, and detects cluster evolution by minimizing the need to recompute the profile. The experimental evaluation demonstrated the effectiveness and efficiency of the approach against the naïve baseline and the state-of-the-art algorithms on stream clustering.
2023
New Trends in Database and Information Systems. ADBIS 2023.
211
220
Forresi, C., Francia, M., Gallinucci, E., Golfarelli, M. (2023). Streaming Approach to Schema Profiling. Cham : Springer [10.1007/978-3-031-42941-5_19].
Forresi, Chiara; Francia, Matteo; Gallinucci, Enrico; Golfarelli, Matteo
File in questo prodotto:
File Dimensione Formato  
ADBIS_2023_Streaming_Profiling__short_.pdf

Open Access dal 01/09/2024

Tipo: Postprint
Licenza: Licenza per accesso libero gratuito
Dimensione 601.31 kB
Formato Adobe PDF
601.31 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/942473
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact