Class prediction for high-dimensional class-imbalanced data

被引:172
作者
Blagus, Rok [1 ]
Lusa, Lara [1 ]
机构
[1] Univ Ljubljana, Inst Biostat & Med Informat, Ljubljana, Slovenia
来源
BMC BIOINFORMATICS | 2010年 / 11卷
关键词
GENE-EXPRESSION DATA; OLIGONUCLEOTIDE MICROARRAY; CLASSIFICATION; VALIDATION; SELECTION; PROFILES; GENOME;
D O I
10.1186/1471-2105-11-523
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. Results: Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. Conclusions: Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.
引用
收藏
页数:17
相关论文
共 45 条
  • [1] Classification by ensembles from random partitions of high-dimensional data
    Ahn, Hongshik
    Moon, Hojin
    Fazzari, Melissa J.
    Lim, Noha
    Chen, James J.
    Kodell, Ralph L.
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 51 (12) : 6166 - 6179
  • [2] Al-Shahib Ali, 2005, Appl Bioinformatics, V4, P195, DOI 10.2165/00822942-200594030-00004
  • [3] [Anonymous], 2003, The Statistical Evaluation of Medical Tests for Classification and Prediction
  • [4] Batista G. E., 2004, ACM SIGKDD Explor. Newslett., P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
  • [5] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [6] Knowledge-based analysis of microarray gene expression data by using support vector machines
    Brown, MPS
    Grundy, WN
    Lin, D
    Cristianini, N
    Sugnet, CW
    Furey, TS
    Ares, M
    Haussler, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (01) : 262 - 267
  • [7] Exploring the new world of the genome with DNA microarrays
    Brown, PO
    Botstein, D
    [J]. NATURE GENETICS, 1999, 21 (Suppl 1) : 33 - 37
  • [8] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [9] SUPPORT-VECTOR NETWORKS
    CORTES, C
    VAPNIK, V
    [J]. MACHINE LEARNING, 1995, 20 (03) : 273 - 297
  • [10] Comparison of discrimination methods for the classification of tumors using gene expression data
    Dudoit, S
    Fridlyand, J
    Speed, TP
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) : 77 - 87