Validation of computational methods in genomics

被引:28
作者
Dougherty, Edward R.
Hua, Jianping
Bittner, Michael L.
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
[2] Translat Genom Res Inst, Computat Biol Div, Phoenix, AZ USA
[3] Univ Texas, MD Anderson Canc Ctr, Dept Pathol, Houston, TX 77030 USA
关键词
GENE-EXPRESSION PROFILES; ERROR ESTIMATORS; FEATURE-SELECTION; CROSS-VALIDATION; OPTIMAL NUMBER; FEATURES; CLASSIFICATION; PREDICTION; CHROMOSOME; SURVIVAL;
D O I
10.2174/138920207780076956
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
High-throughput technologies for genomics provide tens of thousands of genetic measurements, for instance, gene-expression measurements on microarrays, and the availability of these measurements has motivated the use of machine teaming (inference) methods for classification, clustering, and gene networks. Generally, a design method will yield a model that satisfies some model constraints and fits the data in some manner. On the other hand, a scientific theory consists of two parts: (1) a mathematical model to characterize relations between variables, and (2) a set of relations between model variables and observables that are used to validate the model via predictive experiments. Although machine learning algorithms are constructed to hopefully produce valid scientific models, they do not ipso facto do so. In some cases, such as classifier estimation, there is a well-developed error theory that relates to model validity according to various statistical theorems, but in others such as clustering, there is a lack of understanding of the relationship between the learning algorithms and validation. The issue of validation is especially problematic in situations where the sample size is small in comparison with the dimensionality (number of variables), which is commonplace in genomics, because the convergence theory of teaming algorithms is typically asymptotic and the algorithms often perform in counter-intuitive ways when used with samples that are small in relation to the number of variables. For translational genomics, validation is perhaps the most critical issue, because it is imperative that we understand the performance of a diagnostic or therapeutic procedure to be used in the clinic, and this performance relates directly to the validity of the model behind the procedure. This paper treats the validation issue as it appears in two classes of inference algorithms relating to genomics - classification and clustering. It formulates the problem and reviews salient results.
引用
收藏
页码:1 / 19
页数:19
相关论文
共 59 条
[1]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[2]  
Allander SV, 2001, CANCER RES, V61, P8624
[3]   Clustering gene expression patterns [J].
Ben-Dor, A ;
Shamir, R ;
Yakhini, Z .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (3-4) :281-297
[4]   Exact performance of error estimators for discrete classifiers [J].
Braga-Neto, U ;
Dougherty, E .
PATTERN RECOGNITION, 2005, 38 (11) :1799-1814
[5]   Is cross-validation better than resubstitution for ranking genes? [J].
Braga-Neto, U ;
Hashimoto, R ;
Dougherty, ER ;
Nguyen, DV ;
Carroll, RJ .
BIOINFORMATICS, 2004, 20 (02) :253-258
[6]   Bolstered error estimation [J].
Braga-Neto, U ;
Dougherty, E .
PATTERN RECOGNITION, 2004, 37 (06) :1267-1281
[7]   Is cross-validation valid for small-sample microarray classification? [J].
Braga-Neto, UM ;
Dougherty, ER .
BIOINFORMATICS, 2004, 20 (03) :374-380
[8]  
BRAGANETO UM, 2007, IEEE SIGNAL PROCESSI
[9]   Model-based evaluation of clustering validation measures [J].
Brun, Marcel ;
Sima, Chao ;
Hua, Jianping ;
Lowey, James ;
Carroll, Brent ;
Suh, Edward ;
Dougherty, Edward R. .
PATTERN RECOGNITION, 2007, 40 (03) :807-824
[10]   Genetic test bed for feature selection [J].
Choudhary, A ;
Brun, M ;
Hua, JP ;
Lowey, J ;
Suh, E ;
Dougherty, ER .
BIOINFORMATICS, 2006, 22 (07) :837-842