Model-based clustering of probability density functions

Montanari, Angela; Calo', Daniela Giovanna

doi:10.1007/s11634-013-0140-8

Complex data such as those where each statistical unit under study is described not by a single observation (or vector variable), but by a unit-specific sample of several or evenmany observations, are becomingmore and more popular. Reducing these sample data by summary statistics, like the average or the median, implies that most inherent information (about variability, skewness or multi-modality) gets lost. Full information is preserved only if each unit is described by a whole distribution. This new kind of data, a.k.a. “distribution-valued data”, require the development of adequate statistical methods. This paper presents a method to group a set of probability density functions (pdfs) into homogeneous clusters, provided that the pdfs have to be estimated nonparametrically from the unit-specific data. Since elements belonging to the same cluster are naturally thought of as samples from the same probability model, the idea is to tackle the clustering problem by defining and estimating a proper mixture model on the space of pdfs. The issue of model building is challenging here because of the infinite-dimensionality and the non-Euclidean geometry of the domain space. By adopting a wavelet-based representation for the elements in the space, the task is accomplished by using mixture models for hyper-spherical data. The proposed solution is illustrated through a simulation experiment and on two real data sets.

Montanari A., Calò D.G. (2013). Model-based clustering of probability density functions. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 7(3), 301-319 [10.1007/s11634-013-0140-8].