Mining data with random forests: A survey and results of new tests

被引:539
作者
Verikas, A. [1 ,2 ]
Gelzinis, A. [2 ]
Bacauskiene, M. [2 ]
机构
[1] Halmstad Univ, Intelligent Syst Lab, S-30118 Halmstad, Sweden
[2] Kaunas Univ Technol, Dept Elect & Control Equipment, LT-51368 Kaunas, Lithuania
关键词
Random forests; Variable importance; Variable selection; Classifier; Data proximity; SUPPORT VECTOR MACHINE; FEATURE-SELECTION; COMPOUND CLASSIFICATION; CANCER CLASSIFICATION; CHURN PREDICTION; CLASSIFIERS; TREES; MODELS; TOOL; IDENTIFICATION;
D O I
10.1016/j.patcog.2010.08.011
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Random forests (RF) has become a popular technique for classification, prediction, studying variable importance, variable selection, and outlier detection. There are numerous application examples of RF in a variety of fields. Several large scale comparisons including RF have been performed. There are numerous articles, where variable importance evaluations based on the variable importance measures available from RF are used for data exploration and understanding. Apart from the literature survey in RF area, this paper also presents results of new tests regarding variable rankings based on RF variable importance measures. We studied experimentally the consistency and generality of such rankings. Results of the studies indicate that there is no evidence supporting the belief in generality of such rankings. A high variance of variable importance evaluations was observed in the case of small number of trees and small data sets. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:330 / 349
页数:20
相关论文
共 111 条
[1]   Bayesian additive regression trees-based spam detection for enhanced email privacy [J].
Abu-Nimeh, Saeed ;
Nappa, Dario ;
Wang, Xinlei ;
Nair, Suku .
ARES 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON AVAILABILITY, SECURITY AND RELIABILITY, 2008, :1044-1051
[2]   Combined 5 x 2 cv F test for comparing supervised classification learning algorithms [J].
Alpaydin, E .
NEURAL COMPUTATION, 1999, 11 (08) :1885-1892
[3]  
[Anonymous], 1984, OLSHEN STONE CLASSIF, DOI 10.2307/2530946
[4]  
[Anonymous], 2006, Pattern recognition and machine learning
[5]   Empirical characterization of random forest variable importance measures [J].
Archer, Kelfie J. ;
Kirnes, Ryan V. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (04) :2249-2260
[6]   A comparison of decision tree ensemble creation techniques [J].
Banfield, Robert E. ;
Hall, Lawrence O. ;
Bowyer, Kevin W. ;
Kegelmeyer, W. P. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (01) :173-180
[7]  
Belle V, 2008, INT C PATT RECOG, P3547
[8]  
Benediktsson JA, 2007, LECT NOTES COMPUT SC, V4472, P501
[9]   Predicting habitat suitability with machine learning models:: The potential area of Pinus sylvestris L. in the Iberian Peninsula [J].
Benito Garzon, Marta ;
Blazek, Radim ;
Neteler, Markus ;
Sanchez de Dios, Rut ;
Sainz Ollero, Helios ;
Furlanello, Cesare .
ECOLOGICAL MODELLING, 2006, 197 (3-4) :383-393
[10]  
Bernard S, 2007, PROC INT CONF DOC, P1043