A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

被引:2004
作者
Galar, Mikel [1 ]
Fernandez, Alberto [2 ]
Barrenechea, Edurne [1 ]
Bustince, Humberto [1 ]
Herrera, Francisco [3 ]
机构
[1] Univ Publ Navarra, Dept Automat & Computac, Navarra 31006, Spain
[2] Univ Jaen, Dept Comp Sci, Jaen 23071, Spain
[3] Univ Granada, Dept Comp Sci & Artificial Intelligence, E-18071 Granada, Spain
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS | 2012年 / 42卷 / 04期
关键词
Bagging; boosting; class distribution; classification; ensembles; imbalanced data-sets; multiple classifier systems; SUPPORT VECTOR MACHINES; STATISTICAL COMPARISONS; NEURAL-NETWORKS; DECISION TREES; CLASSIFICATION; CLASSIFIERS; PERFORMANCE; STRATEGIES; VARIANCE; ACCURACY;
D O I
10.1109/TSMCC.2011.2161285
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining community. This issue occurs when the number of examples that represent one class is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. In machine learning, the ensemble of classifiers are known to increase the accuracy of single classifiers by combining several of them, but neither of these learning techniques alone solve the class imbalance problem, to deal with this issue the ensemble learning algorithms have to be designed specifically. In this paper, our aim is to review the state of the art on ensemble techniques in the framework of imbalanced data-sets, with focus on two-class problems. We propose a taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based. In addition, we develop a thorough empirical comparison by the consideration of the most significant published approaches, within the families of the taxonomy proposed, to show whether any of them makes a difference. This comparison has shown the good behavior of the simplest approaches which combine random undersampling techniques with bagging or boosting ensembles. In addition, the positive synergy between sampling techniques and bagging has stood out. Furthermore, our results show empirically that ensemble-based algorithms are worthwhile since they outperform the mere use of preprocessing techniques before learning the classifier, therefore justifying the increase of complexity by means of a significant enhancement of the results.
引用
收藏
页码:463 / 484
页数:22
相关论文
共 112 条
[1]
KEEL: a software tool to assess evolutionary algorithms for data mining problems [J].
Alcala-Fdez, J. ;
Sanchez, L. ;
Garcia, S. ;
del Jesus, M. J. ;
Ventura, S. ;
Garrell, J. M. ;
Otero, J. ;
Romero, C. ;
Bacardit, J. ;
Rivas, V. M. ;
Fernandez, J. C. ;
Herrera, F. .
SOFT COMPUTING, 2009, 13 (03) :307-318
[2]
Alcalá-Fdez J, 2011, J MULT-VALUED LOG S, V17, P255
[3]
[Anonymous], 1996, P 13 INT C INT C MAC
[4]
[Anonymous], 2001, Pattern Classification
[5]
New applications of ensembles of classifiers [J].
Barandela, R ;
Sánchez, JS ;
Valdovinos, RM .
PATTERN ANALYSIS AND APPLICATIONS, 2003, 6 (03) :245-256
[6]
Strategies for learning in class imbalance problems [J].
Barandela, R ;
Sánchez, JS ;
García, V ;
Rangel, E .
PATTERN RECOGNITION, 2003, 36 (03) :849-851
[7]
Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[8]
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets [J].
Bermejo, Pablo ;
Gamez, Jose A. ;
Puerta, Jose M. .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (03) :2072-2080
[9]
Blaszczynski J, 2010, LECT NOTES ARTIF INT, V6086, P148, DOI 10.1007/978-3-642-13529-3_17
[10]
The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159