CRIS Current Research Information System

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation.

Hennig C, Lin CJ (2015). Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters. STATISTICS AND COMPUTING, 25(4), 821-833 [10.1007/s11222-015-9566-5].

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Hennig C;Lin CJ

2015

Abstract

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2015
			
	Rivista
	
				STATISTICS AND COMPUTING
			
	Codice DOI
	
				https://dx.doi.org/10.1007/s11222-015-9566-5
			
	Citazione
	
				Hennig C,  Lin CJ (2015). Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters. STATISTICS AND COMPUTING, 25(4), 821-833 [10.1007/s11222-015-9566-5].
			
	Tutti gli autori
	
						Hennig C; Lin CJ
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Hennig-Lin2015_Article_FlexibleParametricBootstrapFor.pdf accesso aperto Tipo: Versione (PDF) editoriale / Version Of Record Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 1.17 MB Formato Adobe PDF Visualizza/Apri	1.17 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/673928

Citazioni

ND

28

24

ND

social impact