Exact performance of error estimators for discrete classifiers

被引:45
作者
Braga-Neto, U
Dougherty, E [1 ]
机构
[1] Fiocruz MS, Aggeu Magalhaes Res Ctr, Virol & Expt Therapy Lab, CPqAM, BR-50670420 Recife, PE, Brazil
[2] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77843 USA
[3] Translat Genom Res Inst, Div Computat Biol, Phoenix, AZ 85004 USA
[4] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
关键词
error estimation; discrete classification; histogram rule; resubstitution; leave-one-out; cross-validation; bootstrap;
D O I
10.1016/j.patcog.2005.02.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Discrete classification problems abound in pattern recognition and data mining applications. One of the most common discrete rules is the discrete histogram rule. This paper presents exact formulas for the computation of bias, variance, and RMS of the resubstitution and leave-one-out error estimators, for the discrete histogram rule. We also describe an algorithm to compute the exact probability distribution of resubstitution and leave-one-out, as well as their deviations from the true error rate. Using a parametric Zipf model, we compute the exact performance of resubstitution and leave-one-out, for varying expected true error, number of samples, and classifier complexity (number of bins). We compare this to approximate performance measures-computed by Monte-Carlo sampling-of 10-repeated 4-fold cross-validation and the 0.632 bootstrap error estimator. Our results show that resubstitution is low-biased but much less variable than leave-one-out, and is effectively the superior error estimator between the two, provided classifier complexity is low. In addition, our results indicate that the overall performance of resubstitution, as measured by the RMS, can be substantially better than the 10-repeated 4-fold cross-validation estimator, and even comparable to the 0.632 bootstrap estimator, provided that classifier complexity is low and the expected error rates are moderate. In addition to the results discussed in the paper, we provide an extensive set of plots that can be accessed on a companion website, at the URL http: / / ee. tamu. edu / similar to edward / exac_ discrete. (c) 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:1799 / 1814
页数:16
相关论文
共 24 条
[1]  
[Anonymous], 1978, Discrete Discriminant Analysis
[2]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[3]  
BRAGANETO UM, 2005, EURASIP BOOK SERIES
[4]  
Devroye L., 1996, A probabilistic theory of pattern recognition
[6]   1977 RIETZ LECTURE - BOOTSTRAP METHODS - ANOTHER LOOK AT THE JACKKNIFE [J].
EFRON, B .
ANNALS OF STATISTICS, 1979, 7 (01) :1-26
[7]   SAMPLE-BASED MULTINOMIAL CLASSIFICATION [J].
GLICK, N .
BIOMETRICS, 1973, 29 (02) :241-256
[8]  
Hart, 2006, PATTERN CLASSIFICATI
[9]  
HILLS M, 1967, APPLIED STATISTICS, V16, P237
[10]   COMPUTING DISTRIBUTIONS FOR EXACT LOGISTIC-REGRESSION [J].
HIRJI, KF ;
MEHTA, CR ;
PATEL, NR .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1987, 82 (400) :1110-1117