Knowledge discovery from imbalanced and noisy data

被引:168
作者
Van Hulse, Jason [1 ]
Khoshgoftaar, Taghi [1 ]
机构
[1] Florida Atlantic Univ, Dept Comp Sci & Engn, Empir Software Engn Lab, Boca Raton, FL 33431 USA
关键词
Data sampling; Class noise; Labeling errors; Class imbalance; Skewed class distribution; ATTRIBUTE NOISE; CLASSIFICATION;
D O I
10.1016/j.datak.2009.08.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance and labeling errors present significant challenges to data mining and knowledge discovery applications. Some previous work has discussed these important topics, however the relationship between these two issues has not received enough attention. Further, much of the previous work in this domain is fragmented and contradictory, leading to serious questions regarding the reliability and validity of the empirical conclusions. In response to these issues, we present a comprehensive suite of experiments carefully designed to provide conclusive, reliable, and significant results on the problem of learning from noisy and imbalanced data. Noise is shown to significantly impact all of the learners considered in this work, and a particularly important factor is the class in which the noise is located (which, as discussed throughout this work, has very important implications to noise handling). The impacts of noise, however, vary dramatically depending on the learning algorithm and simple algorithms such as naive Bayes and nearest neighbor learners are often more robust than more complex learners such as support vector machines or random forests. Sampling techniques, which are often used to alleviate the adverse impacts of imbalanced data, are shown to improve the performance of learners built from noisy and imbalanced data. In particular, simple sampling techniques such as random undersampling are generally the most effective. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:1513 / 1542
页数:30
相关论文
共 52 条
[1]  
Asuncion Arthur, 2007, Uci machine learning repository
[2]  
Barandela R, 2004, LECT NOTES COMPUT SC, V3138, P806
[3]  
Batista G.E., 2004, ACM SIGKDD Explor. Newsl., V6, P20, DOI [DOI 10.1145/1007730.1007735, 10.1145/1007730.1007735]
[4]  
Berenson M.L., 1983, INTERMEDIATE STAT ME
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[7]  
Brodley CE, 1996, PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, P799
[8]   Mining impact-targeted activity patterns in imbalanced data [J].
Cao, Longbing ;
Zhao, Yanchang ;
Zhang, Chengqi .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (08) :1053-1066
[9]   Automatically countering imbalance and its empirical relationship to cost [J].
Chawla, Nitesh V. ;
Cieslak, David A. ;
Hall, Lawrence O. ;
Joshi, Ajay .
DATA MINING AND KNOWLEDGE DISCOVERY, 2008, 17 (02) :225-252
[10]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)