A classification procedure for highly imbalanced class sizes

被引:28
作者
Byon, Eunshin [1 ]
Shrivastava, Abhishek K. [2 ]
Ding, Yu [1 ]
机构
[1] Texas A&M Univ, Dept Ind & Syst Engn, College Stn, TX 77843 USA
[2] City Univ Hong Kong, Dept Mfg Engn & Engn Management, Kowloon, Hong Kong, Peoples R China
基金
美国国家科学基金会;
关键词
Data reduction; detection power; ensemble classifier; false alarm rate; highly imbalanced classification; resampling; support vector machine; NOVELTY DETECTION; PRODUCT;
D O I
10.1080/07408170903228967
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
This article develops an effective procedure for handling two-class classification problems with highly imbalanced class sizes. In many imbalanced two-class problems, the majority class represents "normal" cases, while the minority class represents "abnormal" cases, detection of which is critical to decision making. When the class sizes are highly imbalanced, conventional classification methods tend to strongly favor the majority class, resulting in very low or even no detection of the minority class. The research objective of this article is to devise a systematic procedure to substantially improve the power of detecting the minority class so that the resulting procedure can help screen the original data set and select a much smaller subset for further investigation. A procedure is developed that is based on ensemble classifiers, where each classifier is constructed from a resized training set with reduced dimension space. In addition, how to find the best values of the decision variables in the proposed classification procedure is specified. The proposed method is compared to a set of off-the-shelf classification methods using two real data sets. The prediction results of the proposed method show remarkable improvements over the other methods. The proposed method can detect about 75% of the minority class units, while the other methods turn out much lower detection rates.
引用
收藏
页码:288 / 303
页数:16
相关论文
共 26 条
[1]  
Anderson U, 1998, AUDITING-J PRACT TH, V17, P1
[2]  
BLAKE CL, 2008, REPOSITORY MACHINE L
[3]  
Breiman L., 1996, Tech. Rep. 460
[4]  
Chan P. K., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P164
[5]   Classification ensembles for unbalanced class sizes in predictive toxicology [J].
Chen, JJ ;
Tsai, CA ;
Young, JF ;
Kodell, RL .
SAR AND QSAR IN ENVIRONMENTAL RESEARCH, 2005, 16 (06) :517-529
[6]  
Dimitriadou E, 2008, E1071 MISC FUNCTIONS
[7]  
Domingos P., 1999, P ACM SIGKDD INT C K, P155, DOI DOI 10.1145/312129.312220
[8]   Adaptive fraud detection [J].
Fawcett, T ;
Provost, F .
DATA MINING AND KNOWLEDGE DISCOVERY, 1997, 1 (03) :291-316
[9]  
Hastie T., 2009, ELEMENTS STAT LEARNI, DOI 10.1007/978-0-387-84858-7
[10]  
Hayton P, 2001, ADV NEUR IN, V13, P946