Permutation importance: a corrected feature importance measure

被引：1604

作者：

Altmann, Andre ^{[1
]}

Tolosi, Laura ^{[1
]}

Sander, Oliver ^{[1
]}

Lengauer, Thomas ^{[1
]}

机构：

[1] Max Planck Inst Informat, Dept Computat Biol & Appl Algorithm, Saarbrucken, Germany

来源：

BIOINFORMATICS | 2010年 / 26卷 / 10期

关键词：

MUTUAL INFORMATION; RANDOM FOREST; PREDICTION; INCOME;

D O I：

10.1093/bioinformatics/btq134

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred. Results: In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models. Availability: R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/similar to altmann/download/PIMP.R Contact: altmann@mpi-inf.mpg.de, laura.tolosi@mpi-inf.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.

引用

页码：1340 / 1347

页数：8

共 16 条

[1] Criteria based on mutual information minimization for blind source separation in post nonlinear mixtures [J].

Achard, S ;

Pham, DT ;

Jutten, C .

SIGNAL PROCESSING, 2005, 85 (05) :965-974

[2]

[Anonymous], 2006, ESANN

[3] USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING [J].

BATTITI, R .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04) :537-550

[4] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].

Blewitt, Marnie E. ;

Gendrel, Anne-Valerie ;

Pang, Zhenyi ;

Sparrow, Duncan B. ;

Whitelaw, Nadia ;

Craig, Jeffrey M. ;

Apedaile, Anwyn ;

Hilton, Douglas J. ;

Dunwoodie, Sally L. ;

Brockdorff, Neil ;

Kay, Graham F. ;

Whitelaw, Emma .

NATURE GENETICS, 2008, 40 (05) :663-669

[5] DECOMPOSABLE INCOME INEQUALITY MEASURES [J].

BOURGUIGNON, F .

ECONOMETRICA, 1979, 47 (04) :901-920

[6] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[7] Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA [J].

Cummings, MP ;

Myers, DS .

BMC BIOINFORMATICS, 2004, 5 (1)

[8] Gene selection and classification of microarray data using random forest -: art. no. 3 [J].

Díaz-Uriarte, R ;

de Andrés, SA .

BMC BIOINFORMATICS, 2006, 7 (1)

[9]

Friedman J., 2001, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, V1

[10]

Guyon I., 2003, J MACH LEARN RES, V3, P1157

← 1 2 →