Is cross-validation valid for small-sample microarray classification?

被引:433
作者
Braga-Neto, UM
Dougherty, ER
机构
[1] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77840 USA
[2] Univ Texas, MD Anderson Canc Ctr, Sect Clin Canc Genet, Houston, TX 77030 USA
[3] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
关键词
D O I
10.1093/bioinformatics/btg419
中图分类号
Q5 [生物化学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).
引用
收藏
页码:374 / 380
页数:7
相关论文
共 13 条
[1]
Genomic data sampling and its effect on classification performance assessment [J].
Azuaje, F .
BMC BIOINFORMATICS, 2003, 4 (1)
[2]
Chernick MR., 1999, Bootstrap methods
[3]
a practitioner's guide
[4]
Devroye L., 1996, A probabilistic theory of pattern recognition
[5]
Small sample issues for microarray-based classification [J].
Dougherty, ER .
COMPARATIVE AND FUNCTIONAL GENOMICS, 2001, 2 (01) :28-34
[8]
Friedman J., 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5
[9]
Kohavi R., 1995, INT JOINT C ARTIFICI, DOI DOI 10.5555/1643031.1643047
[10]
A gene-expression signature as a predictor of survival in breast cancer. [J].
van de Vijver, MJ ;
He, YD ;
van 't Veer, LJ ;
Dai, H ;
Hart, AAM ;
Voskuil, DW ;
Schreiber, GJ ;
Peterse, JL ;
Roberts, C ;
Marton, MJ ;
Parrish, M ;
Atsma, D ;
Witteveen, A ;
Glas, A ;
Delahaye, L ;
van der Velde, T ;
Bartelink, H ;
Rodenhuis, S ;
Rutgers, ET ;
Friend, SH ;
Bernards, R .
NEW ENGLAND JOURNAL OF MEDICINE, 2002, 347 (25) :1999-2009