Performance of feature-selection methods in the classification of high-dimension data

被引:239
作者
Hua, Jianping [1 ]
Tembe, Waibhav D. [2 ]
Dougherty, Edward R. [1 ,3 ]
机构
[1] Translat Genom Res Inst, Computat Biol Div, Phoenix, AZ 85004 USA
[2] Translat Genom Res Inst, High Performance Biocomp Div, Phoenix, AZ 85004 USA
[3] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
基金
美国国家科学基金会;
关键词
Classification; Feature selection; Microarray; MOLECULAR CLASSIFICATION; OPTIMAL NUMBER; GENE; PREDICTION; DISCOVERY;
D O I
10.1016/j.patcog.2008.08.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:409 / 424
页数:16
相关论文
共 36 条
[1]   Molecular classification of cutaneous malignant melanoma by gene expression profiling [J].
Bittner, M ;
Meitzer, P ;
Chen, Y ;
Jiang, Y ;
Seftor, E ;
Hendrix, M ;
Radmacher, M ;
Simon, R ;
Yakhini, Z ;
Ben-Dor, A ;
Sampas, N ;
Dougherty, E ;
Wang, E ;
Marincola, F ;
Gooden, C ;
Lueders, J ;
Glatfelter, A ;
Pollock, P ;
Carpten, J ;
Gillanders, E ;
Leja, D ;
Dietrich, K ;
Beaudry, C ;
Berens, M ;
Alberts, D ;
Sondak, V ;
Hayward, N ;
Trent, J .
NATURE, 2000, 406 (6795) :536-540
[2]   Bolstered error estimation [J].
Braga-Neto, U ;
Dougherty, E .
PATTERN RECOGNITION, 2004, 37 (06) :1267-1281
[3]   Epistasis:: too often neglected in complex trait studies? [J].
Carlborg, Ö ;
Haley, CS .
NATURE REVIEWS GENETICS, 2004, 5 (08) :618-U4
[4]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5]  
Dougherty ER, 2006, CANCER INFORM, V2, P189
[6]   BRANCH AND BOUND ALGORITHM FOR COMPUTING K-NEAREST NEIGHBORS [J].
FUKUNAGA, K ;
NARENDRA, PM .
IEEE TRANSACTIONS ON COMPUTERS, 1975, C 24 (07) :750-753
[7]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[8]   Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings [J].
Hanczar, Blaise K ;
Hua, Jianping ;
Dougherty, Edward R. .
EURASIP JOURNAL ON BIOINFORMATICS AND SYSTEMS BIOLOGY, 2007, (01)
[9]   Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution [J].
Hua, JP ;
Xiong, ZX ;
Dougherty, ER .
PATTERN RECOGNITION, 2005, 38 (03) :403-421
[10]   Optimal number of features as a function of sample size for various classification rules [J].
Hua, JP ;
Xiong, ZX ;
Lowey, J ;
Suh, E ;
Dougherty, ER .
BIOINFORMATICS, 2005, 21 (08) :1509-1515