How many samples are needed to build a classifier: a general sequential approach

被引:20
作者
Fu, WJJ
Dougherty, ER
Mallick, B
Carroll, RJ
机构
[1] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA
[2] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77840 USA
[3] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
关键词
D O I
10.1093/bioinformatics/bth461
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable. Results: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.
引用
收藏
页码:63 / 70
页数:8
相关论文
共 14 条
[1]   Selection bias in gene extraction on the basis of microarray gene-expression data [J].
Ambroise, C ;
McLachlan, GJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) :6562-6566
[2]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[3]   PARAMETER-ESTIMATION FOLLOWING GROUP SEQUENTIAL HYPOTHESIS-TESTING [J].
EMERSON, SS ;
FLEMING, TR .
BIOMETRIKA, 1990, 77 (04) :875-892
[4]  
HALL P, 1980, MARTINGALE LIMIT THO
[5]  
KNIGHT K., 2000, C&H TEXT STAT SCI
[6]  
Lai TL, 1997, STAT SINICA, V7, P33
[7]   Unbiased estimation following a group sequential test [J].
Liu, AY ;
Hall, WJ .
BIOMETRIKA, 1999, 86 (01) :71-78
[8]   Molecular portraits of human breast tumours [J].
Perou, CM ;
Sorlie, T ;
Eisen, MB ;
van de Rijn, M ;
Jeffrey, SS ;
Rees, CA ;
Pollack, JR ;
Ross, DT ;
Johnsen, H ;
Akslen, LA ;
Fluge, O ;
Pergamenschikov, A ;
Williams, C ;
Zhu, SX ;
Lonning, PE ;
Borresen-Dale, AL ;
Brown, PO ;
Botstein, D .
NATURE, 2000, 406 (6797) :747-752
[9]   Estimating and reducing bias in group sequential designs with Gaussian independent increment structure [J].
Pinheiro, JC ;
DeMets, DL .
BIOMETRIKA, 1997, 84 (04) :831-845
[10]  
Shorack GR., 2000, Probability for statisticians