Variable selection using random forests

被引:1800
作者
Genuer, Robin [1 ]
Poggi, Jean-Michel [1 ,2 ]
Tuleau-Malot, Christine [3 ]
机构
[1] Univ Paris 11, Math Lab, F-91405 Orsay, France
[2] Univ Paris 05, F-75270 Paris 06, France
[3] Univ Nice Sophia Antipolis, Lab Jean Alexandre Dieudonne, F-06108 Nice 02, France
关键词
Random forests; Regression; Classification; Variable importance; Variable selection; High dimensional data; LINEAR-REGRESSION; GENE SELECTION; CLASSIFICATION; CANCER;
D O I
10.1016/j.patrec.2010.03.014
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:2225 / 2236
页数:12
相关论文
共 37 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]   Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]  
[Anonymous], 2008, RR6729 INRIA
[4]  
[Anonymous], 2001, Computing Science and Statistics
[5]   Empirical characterization of random forest variable importance measures [J].
Archer, Kelfie J. ;
Kirnes, Ryan V. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (04) :2249-2260
[6]  
BENISHAK A, 2008, J SFDS, V149, P43
[7]  
Biau G, 2008, J MACH LEARN RES, V9, P2015
[8]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]  
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350
[10]  
BREIMAN L, 2005, RANDOM FORESTS BERKE