CRIS Current Research Information System

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.

Christian Hennig (2022). An empirical comparison and characterisation of nine popular clustering methods. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 16(1 (March)), 201-229 [10.1007/s11634-021-00478-z].

An empirical comparison and characterisation of nine popular clustering methods

Christian Hennig

2022

Abstract

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Rivista
	
				ADVANCES IN DATA ANALYSIS AND CLASSIFICATION
			
	Codice DOI
	
				https://dx.doi.org/10.1007/s11634-021-00478-z
			
	Citazione
	
				Christian Hennig (2022). An empirical comparison and characterisation of nine popular clustering methods. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 16(1 (March)), 201-229 [10.1007/s11634-021-00478-z].
			
	Tutti gli autori
	
						Christian Hennig
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
clusteringmethods-r.pdf Open Access dal 10/01/2023 Descrizione: Accepted manuscript Tipo: Postprint Licenza: Licenza per accesso libero gratuito Dimensione 440.52 kB Formato Adobe PDF Visualizza/Apri	440.52 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/898140

Citazioni

ND

15

12

social impact