An approach for classification of highly imbalanced data using weighting and undersampling

被引:137
作者
Anand, Ashish [1 ]
Pugalenthi, Ganesan [1 ]
Fogel, Gary B. [2 ]
Suganthan, P. N. [1 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[2] Nat Select Inc, San Diego, CA 92121 USA
关键词
Imbalanced datasets; SVM; Undersampling technique; PROTEIN; PREDICTION; RESIDUES; SEQUENCE; SITES; IDENTIFICATION; CLASSIFIERS;
D O I
10.1007/s00726-010-0595-2
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from similar to 9:1 to similar to 100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.
引用
收藏
页码:1385 / 1391
页数:7
相关论文
共 39 条
[1]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]  
BATUWITA R, 2009, P 8 INT C MACH LEARN
[3]   microPred: effective classification of pre-miRNAs for human miRNA gene prediction [J].
Batuwita, Rukshan ;
Palade, Vasile .
BIOINFORMATICS, 2009, 25 (08) :989-995
[4]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[5]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[6]  
Chawla N. V., 2004, ACM SIGKDD Explorations Newsletter, V6, P1
[7]   Prediction of linear B-cell epitopes using amino acid pair antigenicity scale [J].
Chen, J. ;
Liu, H. ;
Yang, J. ;
Chou, K.-C. .
AMINO ACIDS, 2007, 33 (03) :423-428
[8]   Sequence-based prediction of protein interaction sites with an integrative method [J].
Chen, Xue-Wen ;
Jeong, Jong Cheol .
BIOINFORMATICS, 2009, 25 (05) :585-591
[9]  
CORTES C, 1995, PREDICTION GEN ABILI
[10]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670