Exploration of the variability of variable selection based on distances between bootstrap sample results

Hennig, C.; Sauerbrei, W.

doi:10.1007/s11634-018-00351-6

It is well known that variable selection in multiple regression can be unstable and that the model uncertainty can be considerable. The model uncertainty can be quantified and explored by bootstrap resampling, see Sauerbrei et al. (Biom J 57:531–555, 2015). Here approaches are introduced that use the results of bootstrap replications of the variable selection process to obtain more detailed information about the data. Analyses will be based on dissimilarities between the results of the analyses of different bootstrap samples. Dissimilarities are computed between the vector of predictions, and between the sets of selected variables. The dissimilarities are used to map the models by multidimensional scaling, to cluster them, and to construct heatplots. Clusters can point to different interpretations of the data that could arise from different selections of variables supported by different bootstrap samples. A new measure of variable selection instability is also defined. The methodology can be applied to various regression models, estimators, and variable selection methods. It will be illustrated by three real data examples, using linear regression and a Cox proportional hazards model, and model selection by AIC and BIC.

Hennig C., Sauerbrei W. (2019). Exploration of the variability of variable selection based on distances between bootstrap sample results. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 13(4), 933-963 [10.1007/s11634-018-00351-6].

Exploration of the variability of variable selection based on distances between bootstrap sample results

Hennig C.^{Membro del Collaboration Group};

2019

Abstract

It is well known that variable selection in multiple regression can be unstable and that the model uncertainty can be considerable. The model uncertainty can be quantified and explored by bootstrap resampling, see Sauerbrei et al. (Biom J 57:531–555, 2015). Here approaches are introduced that use the results of bootstrap replications of the variable selection process to obtain more detailed information about the data. Analyses will be based on dissimilarities between the results of the analyses of different bootstrap samples. Dissimilarities are computed between the vector of predictions, and between the sets of selected variables. The dissimilarities are used to map the models by multidimensional scaling, to cluster them, and to construct heatplots. Clusters can point to different interpretations of the data that could arise from different selections of variables supported by different bootstrap samples. A new measure of variable selection instability is also defined. The methodology can be applied to various regression models, estimators, and variable selection methods. It will be illustrated by three real data examples, using linear regression and a Cox proportional hazards model, and model selection by AIC and BIC.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Rivista
	
				ADVANCES IN DATA ANALYSIS AND CLASSIFICATION
			
	Codice DOI
	
				https://dx.doi.org/10.1007/s11634-018-00351-6
			
	Citazione
	
				Hennig C.,  Sauerbrei W. (2019). Exploration of the variability of variable selection based on distances between bootstrap sample results. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 13(4), 933-963 [10.1007/s11634-018-00351-6].
			
	Tutti gli autori
	
						Hennig C.; Sauerbrei W.
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/724394

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

10

8

ND

CRIS Current Research Information System