Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.
Christian Hennig (2022). An empirical comparison and characterisation of nine popular clustering methods. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 16(1 (March)), 201-229 [10.1007/s11634-021-00478-z].
An empirical comparison and characterisation of nine popular clustering methods
Christian Hennig
2022
Abstract
Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.File | Dimensione | Formato | |
---|---|---|---|
clusteringmethods-r.pdf
Open Access dal 10/01/2023
Descrizione: Accepted manuscript
Tipo:
Postprint
Licenza:
Licenza per accesso libero gratuito
Dimensione
440.52 kB
Formato
Adobe PDF
|
440.52 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.