Estimating dataset size requirements for classifying DNA microarray data

被引:192
作者
Mukherjee, S
Tamayo, P
Rogers, S
Rifkin, R
Engle, A
Campbell, C
Golub, TR
Mesirov, JP
机构
[1] MIT, Whitehead Inst, Ctr Genome Res, Cambridge, MA 02139 USA
[2] Dana Farber Canc Inst, Dept Pediat Oncol, Boston, MA 02115 USA
[3] MIT, McGovern Inst, Cambridge, MA 02139 USA
[4] MIT, CBCL, Cambridge, MA 02139 USA
[5] Univ Bristol, Dept Engn Math, Bristol BS8 1TH, Avon, England
关键词
gene expression profiling; molecular pattern recognition; DNA microarrays; microarray analysis; sample size estimation;
D O I
10.1089/106652703321825928
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced. The goal is to use existing classification results to estimate dataset size requirements for future classification experiments and to evaluate the gain in accuracy and significance of classifiers built with additional data. The method is based on fitting inverse power-law models to construct empirical learning curves. It also includes a permutation test procedure to assess the statistical significance of classification performance for a given dataset size. This procedure is applied to several molecular classification problems representing a broad spectrum of levels of complexity.
引用
收藏
页码:119 / 142
页数:24
相关论文
共 37 条
[1]  
Adcock CJ, 1997, J ROY STAT SOC D-STA, V46, P261
[2]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[3]  
Anderson J., 1983, The architecture of cognition
[4]   Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].
Bhattacharjee, A ;
Richards, WG ;
Staunton, J ;
Li, C ;
Monti, S ;
Vasa, P ;
Ladd, C ;
Beheshti, J ;
Bueno, R ;
Gillette, M ;
Loda, M ;
Weber, G ;
Mark, EJ ;
Lander, ES ;
Wong, W ;
Johnson, BE ;
Golub, TR ;
Sugarbaker, DJ ;
Meyerson, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2001, 98 (24) :13790-13795
[5]   Choosing multiple parameters for support vector machines [J].
Chapelle, O ;
Vapnik, V ;
Bousquet, O ;
Mukherjee, S .
MACHINE LEARNING, 2002, 46 (1-3) :131-159
[6]  
CORTES C, 1994, ADV NEURAL INFORMATI, V6
[7]  
Devoye L., 1997, PROBABILISTIC THEORY
[8]  
Dietrich R, 2000, ADV NEUR IN, P359
[9]   Statistical mechanics of support vector networks [J].
Dietrich, R ;
Opper, M ;
Sompolinsky, H .
PHYSICAL REVIEW LETTERS, 1999, 82 (14) :2975-2978
[10]  
ENGEL A, 2001, STAT MECH MACHINE LE