CRIS Current Research Information System

There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, L1 (city block)-, L2 (Euclidean)-, L3 , L4 -, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The L 1 -distance and the boxplot transformation show good results.

Christian Hennig (2020). Minkowski Distances and Standardisation for Clustering and Classification on High-Dimensional Data. Singapore : Imaizumi, Tadashi; Nakayama, Atsuho; Yokoyama, Satoru [10.1007/978-981-15-2700-5_6].

Minkowski Distances and Standardisation for Clustering and Classification on High-Dimensional Data

Christian Hennig

2020

Abstract

There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, L1 (city block)-, L2 (Euclidean)-, L3 , L4 -, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The L 1 -distance and the boxplot transformation show good results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Titolo del volume
	
				Advanced Studies in Behaviormetrics and Data Science
			
	Pagina iniziale
	
				103
			
	Pagina finale
	
				118
			
	Collana/Serie
	
				BEHAVIORMETRICS
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-981-15-2700-5_6
			
	Citazione
	
				Christian Hennig (2020). Minkowski Distances and Standardisation for Clustering and Classification on High-Dimensional Data. Singapore : Imaizumi, Tadashi; Nakayama, Atsuho; Yokoyama, Satoru [10.1007/978-981-15-2700-5_6].
			
	Tutti gli autori
	
						Christian Hennig
					
	Appare nelle tipologie:
	
				2.01 Capitolo / saggio in libro

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/758356

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact