Automatically countering imbalance and its empirical relationship to cost

被引:167
作者
Chawla, Nitesh V. [1 ]
Cieslak, David A. [1 ]
Hall, Lawrence O. [2 ]
Joshi, Ajay [2 ]
机构
[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
classification; unbalanced data; cost-sensitive learning;
D O I
10.1007/s10618-008-0087-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods-MetaCost and the Cost-Sensitive Classifiers-and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.
引用
收藏
页码:225 / 252
页数:28
相关论文
共 41 条
[1]  
Amor N.B., 2004, P 2004 ACM S APPL CO, P420, DOI DOI 10.1145/967900.967989
[2]  
[Anonymous], P 7 ACM SIGKDD INT C
[3]  
BANFIELD RE, 2005, P 6 INT C MULT CLASS, P196
[4]  
Batista G.E.A.P.A., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[5]  
Blake C.L., 1998, UCI repository of machine learning databases
[6]  
BOWYER KW, 2000, P IEEE INT C SYST MA
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]  
Chawla N. V., 2004, ACM Sigkdd Explorations Newsletter, V6, P1, DOI [DOI 10.1145/1007730.1007733, 10.1145/1007730.1007733]
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]  
CHAWLA NV, 2005, KDD WORKSH UT BAS DA