On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers

被引:30
作者
Zollanvari, Amin [1 ]
Braga-Neto, Ulisses M. [1 ]
Dougherty, Edward R. [1 ,2 ,3 ]
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
[2] Translat Genom Res Inst, Computat Biol Div, Phoenix, AZ USA
[3] Univ Texas MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
基金
美国国家科学基金会;
关键词
Error estimation; Parametric classification; Linear discriminant analysis; Sampling distribution; Resubstitution; Leave-one-out; CLINICAL BEHAVIOR; QUADRATIC-FORMS; OVARIAN-CANCER; EXPRESSION; CLASSIFICATION; MICROARRAY; PREDICTION;
D O I
10.1016/j.patcog.2009.05.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Error estimation is a problem of high current interest in many areas of application. This paper concerns the classical problem of determining the performance of error estimators in small-sample settings under a Gaussianity Parametric assumption. We provide here for the first time the exact sampling distribution of the resubstitution and leave-one-out error estimators for linear discriminant analysis (LDA) in the univariate case, which is valid for any sample size and combination of parameters (including unequal variances and sample sizes for each class). In the multivariate case, we provide a quasi-binomial approximation to the distribution of both the resubstitution and leave-one-out error estimators for LDA, under a common but otherwise arbitrary class covariance matrix, which is assumed to be known in the design of the LDA discriminant. We provide numerical examples, using both synthetic and real data, that indicate that these approximations are accurate. provided that LDA classification error is not too large. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2705 / 2723
页数:19
相关论文
共 41 条
[11]   Small sample issues for microarray-based classification [J].
Dougherty, ER .
COMPARATIVE AND FUNCTIONAL GENOMICS, 2001, 2 (01) :28-34
[12]   Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting [J].
Dupuy, Alain ;
Simon, Richard M. .
JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE, 2007, 99 (02) :147-157
[13]   The use of multiple measurements in taxonomic problems [J].
Fisher, RA .
ANNALS OF EUGENICS, 1936, 7 :179-188
[14]   CONSIDERATIONS OF SAMPLE AND FEATURE SIZE [J].
FOLEY, DH .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1972, 18 (05) :618-+
[15]   Comparison of methods for the computation of multivariate t probabilities [J].
Genz, A ;
Bretz, F .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2002, 11 (04) :950-971
[16]  
Genz A., 1992, J. Comput. Graph. Stat, V1, P141, DOI [DOI 10.1080/10618600.1992.10477010, 10.2307/1390838]
[17]   Expression profiling to predict the clinical behaviour of ovarian cancer fails independent evaluation [J].
Gevaert, Olivier ;
De Smet, Frank ;
Van Gorp, Toon ;
Pochet, Nathalie ;
Engelen, Kristof ;
Amant, Frederic ;
De Moor, Bart ;
Timmerman, Dirk ;
Vergote, Ignace .
BMC CANCER, 2008, 8 (1)
[18]   Classifier technology and the illusion of progress [J].
Hand, David J. .
STATISTICAL SCIENCE, 2006, 21 (01) :1-14
[19]   ON THE DISTRIBUTION OF WALD CLASSIFICATION STATISTIC [J].
HARTER, HL .
ANNALS OF MATHEMATICAL STATISTICS, 1951, 22 (01) :58-78
[20]  
HILLS M, 1966, J ROY STAT SOC B, V28, P1