Variance reduction in estimating classification error using sparse datasets

被引:47
作者
Beleites, C
Baumgartner, R
Bowman, C
Somorjai, R
Steiner, G
Salzer, R
Sowa, MG
机构
[1] Natl Res Council Canada, Inst Biodiagnost, Winnipeg, MB R3B 1Y6, Canada
[2] Tech Univ Dresden, D-01062 Dresden, Germany
关键词
error rate estimation; crossvalidation; bootstrap resampling; small sample size;
D O I
10.1016/j.chemolab.2005.04.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In biomedical applications, frequently only a limited number of samples are available for the development and testing of classification rules. Understanding the behavior of the error estimators in this setting is therefore highly desirable. In an extensive study using simulated as well as real-life data we investigated the properties of commonly used error estimators in terms of their bias and variance, and have found that in these small-sample size situations, the influence of variance on the error estimates can be significant, and can dominate the bias. Consequently, our results strongly suggest that bootstrap resampling and/or k-fold crossvalidation-based estimators, especially when computed over multiple data splits, should be preferred in these small-sample size scenarios, because of their reduced variance compared to the more routinely used crossvalidation approaches. While linear partial least squares was used as the classifier/regressor, the general conclusions arising from this study are not qualitatively affected for other classifiers, linear or nonlinear. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:91 / 100
页数:10
相关论文
共 21 条
[1]  
BALKE CL, UCI REPOSITROY MACHI
[2]   Partial least squares for discrimination [J].
Barker, M ;
Rayens, W .
JOURNAL OF CHEMOMETRICS, 2003, 17 (03) :166-173
[3]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[4]   ESTIMATION OF MISCLASSIFICATION PROBABILITIES BY BOOTSTRAP METHODS [J].
CHATTERJEE, S ;
CHATTERJEE, S .
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 1983, 12 (06) :645-656
[5]   ESTIMATION OF ERROR RATE FOR LINEAR DISCRIMINANT FUNCTIONS BY RESAMPLING - NON-GAUSSIAN POPULATIONS [J].
CHERNICK, MR ;
MURTHY, VK ;
NEALY, CD .
COMPUTERS & MATHEMATICS WITH APPLICATIONS, 1988, 15 (01) :29-37
[6]   CORRECTION [J].
CHERNICK, MR .
PATTERN RECOGNITION LETTERS, 1986, 4 (02) :133-142
[7]   APPLICATION OF BOOTSTRAP AND OTHER RESAMPLING TECHNIQUES - EVALUATION OF CLASSIFIER PERFORMANCE [J].
CHERNICK, MR ;
MURTHY, VK ;
NEALY, CD .
PATTERN RECOGNITION LETTERS, 1985, 3 (03) :167-178
[8]   SIMPLS - AN ALTERNATIVE APPROACH TO PARTIAL LEAST-SQUARES REGRESSION [J].
DEJONG, S .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 1993, 18 (03) :251-263
[9]   Improvements on cross-validation: The .632+ bootstrap method [J].
Efron, B ;
Tibshirani, R .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1997, 92 (438) :548-560