Class-imbalanced classifiers for high-dimensional data

被引：262

作者：

Lin, Wei-Jiun ^{[1
,2
]}

Chen, James J. ^{[2
,3
,4
]}

机构：

[1] Feng Chia Univ, Dept Appl Math, Taichung, Taiwan

[2] US FDA, Natl Ctr Toxicol Res, Rockville, MD 20857 USA

[3] China Med Univ, Grad Inst Biostat, Shenyang, Taiwan

[4] China Med Univ, Ctr Biostat, Shenyang, Taiwan

来源：

BRIEFINGS IN BIOINFORMATICS | 2013年 / 14卷 / 01期

关键词：

class-imbalanced prediction; feature selection; lack of data; performance metrics; threshold adjustment; under-sampling ensemble; CANCER CLASSIFICATION; PREDICTION; MICROARRAY; PROFILES;

D O I：

10.1093/bib/bbs006

中图分类号：

Q5 [生物化学];

学科分类号：

070307 [化学生物学];

摘要：

A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte-Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.

引用

页码：13 / 26

页数：14

共 49 条

[1]

Applying support vector machines to imbalanced datasets [J].