SMOTE for high-dimensional class-imbalanced data

被引：819

作者：

Blagus, Rok ^{[1
]}

Lusa, Lara ^{[1
]}

机构：

[1] Univ Ljubljana, Inst Biostat & Med Informat, Ljubljana, Slovenia

来源：

BMC BIOINFORMATICS | 2013年 / 14卷

关键词：

DATA SETS; CLASSIFICATION; PREDICTION; DISCRIMINATION; SIGNATURE;

D O I：

10.1186/1471-2105-14-106

中图分类号：

Q5 [生物化学];

学科分类号：

070307 [化学生物学];

摘要：

Background: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. Results: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. Conclusions: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

引用

页数：16

共 43 条

[1]

microPred: effective classification of pre-miRNAs for human miRNA gene prediction [J].

Batuwita, Rukshan ;

Palade, Vasile .

BIOINFORMATICS, 2009, 25 (08) :989-995

[2]

Beyer K, 1999, LECT NOTES COMPUT SC, V1540, P217

[3]

Bishop C. M., 2007, Technometrics, DOI DOI 10.1198/TECH.2007.S518

[4]

Class prediction for high-dimensional class-imbalanced data [J].

Blagus, Rok ;

Lusa, Lara .

BMC BIOINFORMATICS, 2010, 11 :523

[5]

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].

Blewitt, Marnie E. ;

Gendrel, Anne-Valerie ;

Pang, Zhenyi ;

Sparrow, Duncan B. ;

Whitelaw, Nadia ;

Craig, Jeffrey M. ;

Apedaile, Anwyn ;

Hilton, Douglas J. ;

Dunwoodie, Sally L. ;

Brockdorff, Neil ;

Kay, Graham F. ;

Whitelaw, Emma .

NATURE GENETICS, 2008, 40 (05) :663-669

[6]

Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[7]

Bunkhumpornpat C, 2009, LECT NOTES ARTIF INT, V5476, P475, DOI 10.1007/978-3-642-01307-2_43

[8]

SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

[9]

Combating imbalance in network intrusion datasets [J].

Cieslak, David A. ;

Chawla, Nitesh V. ;

Striegel, Aaron .

2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, 2006, :732-+

[10]

SUPPORT-VECTOR NETWORKS [J].

CORTES, C ;

VAPNIK, V .

MACHINE LEARNING, 1995, 20 (03) :273-297

← 1 2 3 4 5 →