Corrected small-sample estimation of the Bayes error

被引:4
作者
Brun, M
Sabbagh, D
Kim, S
Dougherty, ER [1 ]
机构
[1] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77840 USA
[2] NHGRI, Canc Genet Branch, Bethesda, MD 20892 USA
[3] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
关键词
D O I
10.1093/bioinformatics/btg144
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: A major problem of pattern classification is estimation of the Bayes error when only small samples are available. One way to estimate the Bayes error is to design a classifier based on some classification rule applied to sample data, estimate the error of the designed classifier, and then use this estimate as an estimate of the Bayes error. Relative to the Bayes error, the expected error of the designed classifier is biased high, and this bias can be severe with small samples. Results: This paper provides a correction for the bias by subtracting a term derived from the representation of the estimation error. It does so for Boolean classifiers, these being defined on binary features. Although the general theory applies to any Boolean classifier, a model is introduced to reduce the number of parameters. A key point is that the expected correction is conservative. Properties of the corrected estimate are studied via simulation. The correction applies to binary predictors because they are mathematically identical to Boolean classifiers. In this context the correction is adapted to the coefficient of determination, which has been used to measure nonlinear multivariate relations between genes and design genetic regulatory networks. An application using gene-expression data from a microarray experiment is provided on the website http://gspsnap.tamu.edu/smallsample/ (user:'smallsample', password:'smallsample)').
引用
收藏
页码:944 / 951
页数:8
相关论文
共 6 条
[1]  
Devroye L., 1996, A probabilistic theory of pattern recognition
[2]   Small sample issues for microarray-based classification [J].
Dougherty, ER .
COMPARATIVE AND FUNCTIONAL GENOMICS, 2001, 2 (01) :28-34
[3]  
Kauffman S., 1993, The Origins of Order
[4]   General nonlinear framework for the analysis of gene interaction via multivariate expression arrays [J].
Kim, S ;
Dougherty, ER ;
Bittner, ML ;
Chen, YD ;
Sivakumar, K ;
Meltzer, P ;
Trent, JM .
JOURNAL OF BIOMEDICAL OPTICS, 2000, 5 (04) :411-424
[5]   Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks [J].
Shmulevich, I ;
Dougherty, ER ;
Kim, S ;
Zhang, W .
BIOINFORMATICS, 2002, 18 (02) :261-274
[6]  
Wilks S. S., 1962, Mathematical statistics