Optimizing the number of trees in a decision forest to discover a subforest with high ensemble accuracy using a genetic algorithm

被引:58
作者
Adnan, Md Nasim [1 ]
Islam, Md Zahidul [1 ]
机构
[1] Charles Sturt Univ, Sch Comp & Math, Bathurst, NSW 2795, Australia
关键词
Classification; Decision tree; Decision forest; Random forest; Ensemble accuracy; DIVERSITY;
D O I
10.1016/j.knosys.2016.07.016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A decision forest is an ensemble of decision trees, and it is often built to discover more patterns (i.e. logic rules) and predict/classify class values more accurately than a single decision tree. Existing decision forest algorithms are typically used for building huge numbers of decision trees, involving large memory and computational overhead, in order to achieve high accuracy. Generally, many of the trees do not contribute to improving the ensemble accuracy of a forest As a result, ensemble pruning algorithms aim to get rid of those trees while generating a subforest in order to achieve higher (or comparable) ensemble accuracy than the original forest. The objectives are two fold: select as small number of trees as possible, and maintain the ensemble accuracy of the subforest as high as possible. An optimal subforest can be found by exhaustive search; however it is not practical for any standard-sized forest as the number of candidate subforests grows exponentially. In order to avoid the computational burden of an exhaustive search, many greedy and genetic algorithm-based subforest selection techniques have been proposed in literature. In this paper, we propose a subforest selection technique that achieves small size as well as high accuracy. We use a genetic algorithm where we carefully select high quality individual trees for the initial population of the genetic algorithm in order to improve the final output of the algorithm. Experiments are conducted on 20 data sets from the UCI Machine Learning Repository to compare the proposed technique with several existing state-of-the-art techniques. The results indicate that the proposed technique can select effective subforests which are significantly smaller than original forests while achieving better (or comparable) accuracy than the original forests. (C) 2016 Elsevier B.V. All rights reserved.
引用
收藏
页码:86 / 97
页数:12
相关论文
共 53 条
[2]  
AccuWeather Inc. Accuweather global weather center, 2015, ACC GLOB WEATH CTR
[3]  
Amasyali M. F., 2014, IEEE T KNOWL DATA EN, V16, P145
[4]  
[Anonymous], 2006, Introduction to Data Mining
[5]  
[Anonymous], 2011, Pei. data mining concepts and techniques, DOI 10.1016/C2009-0-61819-5
[6]   A survey of cross-validation procedures for model selection [J].
Arlot, Sylvain ;
Celisse, Alain .
STATISTICS SURVEYS, 2010, 4 :40-79
[7]  
Bache K., 2013, UCI Machine Learning Repository
[8]  
Beg A. H., 2016, COMPUTATIONAL INTELL, P575
[9]  
Bernard S, 2008, LECT NOTES COMPUT SC, V5227, P430, DOI 10.1007/978-3-540-85984-0_52
[10]   Dynamic Random Forests [J].
Bernard, Simon ;
Adam, Sebastien ;
Heutte, Laurent .
PATTERN RECOGNITION LETTERS, 2012, 33 (12) :1580-1586