Large datasets are increasingly common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be clone in a forward selection procedure that includes selecting the variable to enter, deciding to retain it or stop the selection, and estimating the augmented model. Least squares plus t tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this article we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Because simply replacing the classical statistical criteria with robust ones is not computationally possible, we develop simplified robust estimators, selection criteria, and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t statistic that we compare with a false discovery rate-adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust least angle regression and random forests. Supplemental materials are available online.

Fast Robust Model Selection in Large Datasets / Dupuis, DJ; Victoria-Feser, MP. - In: JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. - ISSN 0162-1459. - STAMPA. - 106:493(2011), pp. 203-212. [10.1198/jasa.2011.tm09650]

Fast Robust Model Selection in Large Datasets

Victoria-Feser, MP
2011

Abstract

Large datasets are increasingly common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates. This can be clone in a forward selection procedure that includes selecting the variable to enter, deciding to retain it or stop the selection, and estimating the augmented model. Least squares plus t tests can be fast, but the outcome of a forward selection might be suboptimal when there are outliers. In this article we propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes. Because simply replacing the classical statistical criteria with robust ones is not computationally possible, we develop simplified robust estimators, selection criteria, and testing procedures for linear regression. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t statistic that we compare with a false discovery rate-adjusted level. We carry out a simulation study to show the good performance of our approach. We also analyze two datasets and show that the results obtained by our method outperform those from robust least angle regression and random forests. Supplemental materials are available online.
2011
Fast Robust Model Selection in Large Datasets / Dupuis, DJ; Victoria-Feser, MP. - In: JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. - ISSN 0162-1459. - STAMPA. - 106:493(2011), pp. 203-212. [10.1198/jasa.2011.tm09650]
Dupuis, DJ; Victoria-Feser, MP
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/952881
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? 10
social impact