The behaviour of random forest permutation-based variable importance measures under predictor correlation

被引:262
作者
Nicodemus, Kristin K. [1 ,2 ,3 ]
Malley, James D. [4 ]
Strobl, Carolin [5 ]
Ziegler, Andreas [6 ]
机构
[1] Univ Oxford, Wellcome Trust Ctr Human Genet, Oxford OX3 7BN, England
[2] Univ Oxford, Dept Clin Pharmacol, Oxford OX3 7DQ, England
[3] NIMH, Genes Cognit & Psychosis Program, Intramural Res Program, NIH, Bethesda, MD USA
[4] NIH, Math & Stat Comp Lab, Div Computat Biosci, Ctr Informat Technol, Bethesda, MD 20892 USA
[5] Univ Munich, Dept Stat, D-80539 Munich, Germany
[6] Med Univ Lubeck, Inst Med Biometrie & Stat, Univ Klinikum Schleswig Holstein, D-23562 Lubeck, Germany
来源
BMC BIOINFORMATICS | 2010年 / 11卷
基金
英国惠康基金;
关键词
Linear Regression Model; Random Forest; Correlate Variable; Bivariate Model; Correlate Predictor;
D O I
10.1186/1471-2105-11-110
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results. Results: In the case when both predictor correlation was present and predictors were associated with the outcome (H-A), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H-0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under H-A and was unbiased under H-0. Scaled VIMs were clearly biased under H-A and H-0. Conclusions: Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.
引用
收藏
页数:13
相关论文
共 10 条
[1]  
[Anonymous], 2007, R LANG ENV STAT COMP
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[3]   Detecting gene-gene interactions that underlie human diseases [J].
Cordell, Heather J. .
NATURE REVIEWS GENETICS, 2009, 10 (06) :392-404
[4]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[5]   Unbiased recursive partitioning: A conditional inference framework [J].
Hothorn, Torsten ;
Hornik, Kurt ;
Zeileis, Achim .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2006, 15 (03) :651-674
[6]  
Kendall M., 1979, Inference and: Relationship
[7]   Performance of random forest when SNPs are in linkage disequilibrium [J].
Meng, Yan A. ;
Yu, Yi ;
Cupples, L. Adrienne ;
Farrer, Lindsay A. ;
Lunetta, Kathryn L. .
BMC BIOINFORMATICS, 2009, 10
[8]   Predictor correlation impacts machine learning algorithms: implications for genomic studies [J].
Nicodemus, Kristin K. ;
Malley, James D. .
BIOINFORMATICS, 2009, 25 (15) :1884-1890
[9]  
STROBL C, 2008, COMPSTAT 2008 P COMP, V2, P59
[10]   Conditional variable importance for random forests [J].
Strobl, Carolin ;
Boulesteix, Anne-Laure ;
Kneib, Thomas ;
Augustin, Thomas ;
Zeileis, Achim .
BMC BIOINFORMATICS, 2008, 9 (1)