Is cross-validation valid for small-sample microarray classification?

被引：433

作者：

Braga-Neto, UM

Dougherty, ER

机构：

[1] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77840 USA

[2] Univ Texas, MD Anderson Canc Ctr, Sect Clin Canc Genet, Houston, TX 77030 USA

[3] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA

来源：

BIOINFORMATICS | 2004年 / 20卷 / 03期

关键词：

D O I：

10.1093/bioinformatics/btg419

中图分类号：

Q5 [生物化学];

学科分类号：

071010 [生物化学与分子生物学]; 081704 [应用化学];

摘要：

Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).

引用

页码：374 / 380

页数：7

共 13 条

[1]

Genomic data sampling and its effect on classification performance assessment [J].