Applying support vector machines to imbalanced datasets

被引:686
作者
Akbani, R
Kwek, S
Japkowicz, N
机构
[1] Univ Texas, Dept Comp Sci, San Antonio, TX 78249 USA
[2] Univ Ottawa, Sch Informat Technol & Engn, Ottawa, ON K1N 6N5, Canada
来源
MACHINE LEARNING: ECML 2004, PROCEEDINGS | 2004年 / 3201卷
关键词
D O I
10.1007/978-3-540-30115-8_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.
引用
收藏
页码:39 / 50
页数:12
相关论文
共 15 条
  • [1] TOLERATING NOISY, IRRELEVANT AND NOVEL ATTRIBUTES IN INSTANCE-BASED LEARNING ALGORITHMS
    AHA, DW
    [J]. INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1992, 36 (02): : 267 - 287
  • [2] [Anonymous], 2000, P 2000 INT C ART INT
  • [3] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [4] Cortes C., 1995, THESIS U ROCHESTER
  • [5] CRISTIANINI N, 2002, J MACHINE LEARNING R, V1
  • [6] Cristianini N., 2000, Intelligent Data Analysis: An Introduction, DOI 10.1017/CBO9780511801389
  • [7] Joachims Thorsten, 1998, P ECML 98 10 EUR C M, P137
  • [8] Kubat M., 1997, ICML
  • [9] KUBAT M, 1997, P ECML 97 9 EUR C MA
  • [10] LING C, 1998, P 4 INT C KNOWL DIS