Class-imbalanced classifiers for high-dimensional data

被引:262
作者
Lin, Wei-Jiun [1 ,2 ]
Chen, James J. [2 ,3 ,4 ]
机构
[1] Feng Chia Univ, Dept Appl Math, Taichung, Taiwan
[2] US FDA, Natl Ctr Toxicol Res, Rockville, MD 20857 USA
[3] China Med Univ, Grad Inst Biostat, Shenyang, Taiwan
[4] China Med Univ, Ctr Biostat, Shenyang, Taiwan
关键词
class-imbalanced prediction; feature selection; lack of data; performance metrics; threshold adjustment; under-sampling ensemble; CANCER CLASSIFICATION; PREDICTION; MICROARRAY; PROFILES;
D O I
10.1093/bib/bbs006
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte-Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.
引用
收藏
页码:13 / 26
页数:14
相关论文
共 49 条
[1]
Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[2]
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays [J].
Alon, U ;
Barkai, N ;
Notterman, DA ;
Gish, K ;
Ybarra, S ;
Mack, D ;
Levine, AJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (12) :6745-6750
[3]
[Anonymous], INTELL DATA ANAL
[4]
[Anonymous], P ICML 03 WORKSH LEA
[5]
[Anonymous], P WORKSH LEARN IMB D
[6]
Development of biomarker classifiers from high-dimensional data [J].
Baek, Songjoon ;
Tsai, Chen-An ;
Chen, James J. .
BRIEFINGS IN BIOINFORMATICS, 2009, 10 (05) :537-546
[7]
Strategies for learning in class imbalance problems [J].
Barandela, R ;
Sánchez, JS ;
García, V ;
Rangel, E .
PATTERN RECOGNITION, 2003, 36 (03) :849-851
[8]
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].
Bhattacharjee, A ;
Richards, WG ;
Staunton, J ;
Li, C ;
Monti, S ;
Vasa, P ;
Ladd, C ;
Beheshti, J ;
Bueno, R ;
Gillette, M ;
Loda, M ;
Weber, G ;
Mark, EJ ;
Lander, ES ;
Wong, W ;
Johnson, BE ;
Golub, TR ;
Sugarbaker, DJ ;
Meyerson, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795
[9]
BISHOP HM, 1979, LANCET, V2, P283
[10]
Class prediction for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2010, 11 :523