Performance of Error Estimators for Classification

被引:40
作者
Dougherty, Edward R. [1 ,2 ,3 ]
Sima, Chao [2 ]
Hua, Jianping [2 ]
Hanczar, Blaise [4 ]
Braga-Neto, Ulisses M. [1 ]
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
[2] Translat Genom Res Inst, Computat Biol Div, Phoenix, AZ USA
[3] Univ Texas MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
[4] Univ Paris 05, LIPADE, Paris, France
基金
美国国家科学基金会;
关键词
Classification; epistemology; error estimation; validity; GENE-EXPRESSION SIGNATURE; CROSS-VALIDATION; PREDICTION; MICROARRAY; CANCER; METASTASIS; SURVIVAL; PROFILE;
D O I
10.2174/157489310790596385
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Classification in bioinformatics often suffers from small samples in conjunction with large numbers of features, which makes error estimation problematic. When a sample is small, there is insufficient data to split the sample and the same data are used for both classifier design and error estimation. Error estimation can suffer from high variance, bias, or both. The problem of choosing a suitable error estimator is exacerbated by the fact that estimation performance depends on the rule used to design the classifier, the feature-label distribution to which the classifier is to be applied, and the sample size. This paper reviews the performance of training-sample error estimators with respect to several criteria: estimation accuracy, variance, bias, correlation with the true error, regression on the true error, and accuracy in ranking feature sets. A number of error estimators are considered: resubstitution, leave-one-out cross-validation, 10-fold cross-validation, bolstered resubstitution, semi-bolstered resubstitution, .632 bootstrap, .632+ bootstrap, and optimal bootstrap. It illustrates these performance criteria for certain models and for two real data sets, referring to the literature for more extensive applications of these criteria. The results given in the present paper are consistent with those in the literature and lead to two conclusions: (1) much greater effort needs to be focused on error estimation, and (2) owing to the generally poor performance of error estimators on small samples, for a conclusion based on a small-sample error estimator to be considered valid, it should be supported by evidence that the estimator in question can be expected to perform sufficiently well under the circumstances to justify the conclusion.
引用
收藏
页码:53 / 67
页数:15
相关论文
共 35 条
[1]
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia [J].
Armstrong, SA ;
Staunton, JE ;
Silverman, LB ;
Pieters, R ;
de Boer, ML ;
Minden, MD ;
Sallan, SE ;
Lander, ES ;
Golub, TR ;
Korsmeyer, SJ .
NATURE GENETICS, 2002, 30 (01) :41-47
[2]
Stage II colon cancer prognosis prediction by tumor gene expression profiling [J].
Barrier, Alain ;
Boelle, Pierre-Yves ;
Roser, Francois ;
Gregg, Jennifer ;
Tse, Chantal ;
Brault, Didier ;
Lacaine, Francois ;
Houry, Sidney ;
Huguier, Michel ;
Franc, Brigitte ;
Flahault, Antoine ;
Lemoine, Antoinette ;
Dudoit, Sandrine .
JOURNAL OF CLINICAL ONCOLOGY, 2006, 24 (29) :4685-4691
[3]
BHATTACHARJEE A, 2001, P NATL ACAD SCI USA, V96, P6745
[4]
BRAGA UM, 2010, EXAT CORREL IN PRESS
[5]
Bolstered error estimation [J].
Braga-Neto, U ;
Dougherty, E .
PATTERN RECOGNITION, 2004, 37 (06) :1267-1281
[6]
Fads and fallacies in the name of small-sample microarray classification [J].
Braga-Neto, Ulisses .
IEEE SIGNAL PROCESSING MAGAZINE, 2007, 24 (01) :91-99
[7]
Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[8]
On the epistemological crisis in genomics [J].
Dougherty, Edward R. .
CURRENT GENOMICS, 2008, 9 (02) :69-79
[9]
Validation of computational methods in genomics [J].
Dougherty, Edward R. ;
Hua, Jianping ;
Bittner, Michael L. .
CURRENT GENOMICS, 2007, 8 (01) :1-19
[10]
Epistemology of computational biology: Mathematical models and experimental prediction as the basis of their validity [J].
Dougherty, ER ;
Braga-Neto, U .
JOURNAL OF BIOLOGICAL SYSTEMS, 2006, 14 (01) :65-90