A dynamic over-sampling procedure based on sensitivity for multi-class problems

被引:139
作者
Fernandez-Navarro, Francisco [1 ]
Hervas-Martinez, Cesar [1 ]
Antonio Gutierrez, Pedro [1 ]
机构
[1] Univ Cordoba, Dept Comp Sci & Numer Anal, E-14071 Cordoba, Spain
关键词
Classification; Multi-class; Sensitivity; Accuracy; Memetic algorithm; Imbalanced datasets; Over-sampling method; SMOTE; FUNCTION NEURAL-NETWORK; IMBALANCED DATA; ROC CURVE; CLASSIFICATION; SYSTEMS; AREA;
D O I
10.1016/j.patcog.2011.02.019
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
Classification with imbalanced datasets supposes a new challenge for researches in the framework of machine learning. This problem appears when the number of patterns that represents one of the classes of the dataset (usually the concept of interest) is much lower than in the remaining classes. Thus, the learning model must be adapted to this situation, which is very common in real applications. In this paper, a dynamic over-sampling procedure is proposed for improving the classification of imbalanced datasets with more than two classes. This procedure is incorporated into a memetic algorithm (MA) that optimizes radial basis functions neural networks (RBFNNs). To handle class imbalance, the training data are resampled in two stages. In the first stage, an over-sampling procedure is applied to the minority class to balance in part the size of the classes. Then, the MA is run and the data are over-sampled in different generations of the evolution, generating new patterns of the minimum sensitivity class (the class with the worst accuracy for the best RBFNN of the population). The methodology proposed is tested using 13 imbalanced benchmark classification datasets from well-known machine learning problems and one complex problem of microbial growth. It is compared to other neural network methods specifically designed for handling imbalanced data. These methods include different over-sampling procedures in the preprocessing stage, a threshold-moving method where the output threshold is moved toward inexpensive classes and ensembles approaches combining the models obtained with these techniques. The results show that our proposal is able to improve the sensitivity in the generalization set and obtains both a high accuracy level and a good classification level for each class. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1821 / 1833
页数:13
相关论文
共 60 条
[1]
Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]
[Anonymous], 2007, ICML, DOI DOI 10.1145/1273496.1273614
[3]
[Anonymous], 1987, Multiple comparison procedures
[4]
[Anonymous], 2007, Uci machine learning repository
[5]
[Anonymous], 1997, P 14 INT C ONMACHINE
[6]
Back T., 1996, EVOLUTIONARY ALGORIT, DOI DOI 10.1093/OSO/9780195099713.001.0001
[7]
Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[8]
Cardie C., 1997, Proceedings of the Fourteenth International Conference on Machine Learning, P57
[9]
Chawla N.V., 2006, AIGKDD EXPLORATIONS, V6, P1
[10]
SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)