How large a training set is needed to develop a classifier for microarray data?

被引:89
作者
Dobbin, Kevin K. [1 ]
Zhao, Yingdong [1 ]
Simon, Richard M. [1 ]
机构
[1] NCI, Biometr Res Branch, Div Canc Treatment & Diag, NIH, Rockville, MD 20852 USA
关键词
D O I
10.1158/1078-0432.CCR-07-0443
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Purpose: A common goal of gene expression microarray studies is the development of a classifier that can be used to divide patients into groups with different prognoses, or with different expected responses to a therapy. These types of classifiers are developed on a training set, which is the set of samples used to train a classifier. The question of how many samples are needed in the training set to produce a good classifier from high-dimensional microarray data is challenging. Experimental Design: We present a model-based approach to determining the sample size required to adequately train a classifier. Results: It is shown that sample size can be determined from three quantities: standardized fold change, class prevalence, and number of genes or features on the arrays. Numerous examples and important experimental design issues are discussed. The method is adapted to address ex post facto determination of whether the size of a training set used to develop a classifier was adequate. An interactive web site for performing the sample size calculations is provided. Conclusion: We showed that sample size calculations for classifier development from high-dimensional microarray data are feasible, discussed numerous important considerations, and presented examples.
引用
收藏
页码:108 / 114
页数:7
相关论文
共 17 条
[1]   Gene-expression profiles predict survival of patients with lung adenocarcinoma [J].
Beer, DG ;
Kardia, SLR ;
Huang, CC ;
Giordano, TJ ;
Levin, AM ;
Misek, DE ;
Lin, L ;
Chen, GA ;
Gharib, TG ;
Thomas, DG ;
Lizyness, ML ;
Kuick, R ;
Hayasaka, S ;
Taylor, JMG ;
Iannettoni, MD ;
Orringer, MB ;
Hanash, S .
NATURE MEDICINE, 2002, 8 (08) :816-824
[2]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[3]  
Carlin B. P., 2001, BAYES EMPIRICAL BAYE
[4]   Sample size determination in microarray experiments for class comparison and prognostic classification [J].
Dobbin, K ;
Simon, R .
BIOSTATISTICS, 2005, 6 (01) :27-38
[5]   Sample size planning for developing classifiers using high-dimensional DNA microarray data [J].
Dobbin, Kevin K. ;
Simon, Richard M. .
BIOSTATISTICS, 2007, 8 (01) :101-117
[6]   Outcome signature genes in breast cancer: is there a unique set? [J].
Ein-Dor, L ;
Kela, I ;
Getz, G ;
Givol, D ;
Domany, E .
BIOINFORMATICS, 2005, 21 (02) :171-178
[7]   Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer [J].
Ein-Dor, L ;
Zuk, O ;
Domany, E .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2006, 103 (15) :5923-5928
[8]   Concordance among gene-expression-based predictors for breast cancer [J].
Fan, Cheng ;
Oh, Daniel S. ;
Wessels, Lodewyk ;
Weigelt, Britta ;
Nuyten, Dimitry S. A. ;
Nobel, Andrew B. ;
van't Veer, Laura J. ;
Perou, Charles M. .
NEW ENGLAND JOURNAL OF MEDICINE, 2006, 355 (06) :560-569
[9]   Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring [J].
Golub, TR ;
Slonim, DK ;
Tamayo, P ;
Huard, C ;
Gaasenbeek, M ;
Mesirov, JP ;
Coller, H ;
Loh, ML ;
Downing, JR ;
Caligiuri, MA ;
Bloomfield, CD ;
Lander, ES .
SCIENCE, 1999, 286 (5439) :531-537
[10]   Controlling the number of false discoveries: application to high-dimensional genomic data [J].
Korn, EL ;
Troendle, JF ;
McShane, LM ;
Simon, R .
JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2004, 124 (02) :379-398