Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy

被引:289
作者
Garcia, Salvador [1 ]
Herrera, Francisco [1 ]
机构
[1] Univ Granada, Dept Comp Sci & Artificial Intelligence, E-18071 Granada, Spain
关键词
Classification; class imbalance problem; undersampling; prototype selection; evolutionary algorithms; FEATURE-SELECTION; ALGORITHMS; REDUCTION; SYSTEMS; MODELS; RULES; SETS;
D O I
10.1162/evco.2009.17.3.275
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the preprocessing focused oil balancing data, two tendencies exist: reduce the set of examples (undersampling) or replicate minority class examples (oversampling). Undersampling with imbalanced datasets could be considered as a prototype selection procedure with the purpose of balancing datasets to achieve a high classification rate, avoiding the bias toward majority class examples. Evolutionary algorithms have been used for classical prototype selection showing good results, where the fitness function is associated to the classification and reduction rates. In this paper, we propose a set of methods called evolutionary undersampling that take into consideration the nature of the problem and use different fitness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison among our models and state of the art undersampling methods. The results have been contrasted by using nonparametric statistical procedures and show that evolutionary undersampling outperforms the nonevolutionary models when the degree of imbalance is increased.
引用
收藏
页码:275 / 306
页数:32
相关论文
共 52 条
[1]  
AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
[2]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[3]   Natural language tagging with genetic algorithms [J].
Alba, Enrique ;
Luque, Gabriel ;
Araujo, Lourdes .
INFORMATION PROCESSING LETTERS, 2006, 100 (05) :173-182
[4]   Genetic learning of accurate and compact fuzzy rule based systems based on the 2-tuples linguistic representation [J].
Alcala, Rafael ;
Alcala-Fdez, Jesus ;
Herrera, Francisco ;
Otero, Jose .
INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2007, 44 (01) :45-64
[5]  
[Anonymous], 2003, SMOTEBoost: improving prediction of the minority class in boosting, DOI DOI 10.1007/978-3-540-39804-2_12
[6]  
[Anonymous], 1998, UCI REPOSITORY MACHI
[7]   Strategies for learning in class imbalance problems [J].
Barandela, R ;
Sánchez, JS ;
García, V ;
Rangel, E .
PATTERN RECOGNITION, 2003, 36 (03) :849-851
[8]  
Batista G.E., 2004, ACM SIGKDD Explor. Newsl., V6, P20, DOI [DOI 10.1145/1007730.1007735, 10.1145/1007730.1007735]
[9]   Domain of competence of XCS classifier system in complexity measurement space [J].
Bernadó-Mansilla, E ;
Ho, TK .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2005, 9 (01) :82-104
[10]   Accuracy-based Learning Classifier Systems:: Models, analysis and applications to classification tasks [J].
Bernadó-Mansilla, E ;
Garrell-Guiu, JM .
EVOLUTIONARY COMPUTATION, 2003, 11 (03) :209-238