How many samples are needed to build a classifier: a general sequential approach

被引：20

作者：

Fu, WJJ

Dougherty, ER

Mallick, B

Carroll, RJ

机构：

[1] Texas A&M Univ, Dept Stat, College Stn, TX 77843 USA

[2] Texas A&M Univ, Dept Elect Engn, College Stn, TX 77840 USA

[3] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA

来源：

BIOINFORMATICS | 2005年 / 21卷 / 01期

关键词：

D O I：

10.1093/bioinformatics/bth461

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable. Results: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.

引用

页码：63 / 70

页数：8

共 14 条

[1] Selection bias in gene extraction on the basis of microarray gene-expression data [J].