A hierarchical Bayesian model for predicting the functional consequences of amino-acid polymorphisms

被引:15
作者
Verzilli, CJ
Whittaker, JC
Stallard, N
Chasman, D
机构
[1] Univ London Imperial Coll Sci & Technol, Dept Epidemiol & Publ Hlth, London W2 1PG, England
[2] Univ Reading, Reading RG6 2AH, Berks, England
[3] Variagenics, Cambridge, MA USA
关键词
Bayesian inference; hierarchical model; multivariate adaptive regression splines; protein site-directed mutagenesis; supervised learning;
D O I
10.1111/j.1467-9876.2005.00478.x
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Genetic polymorphisms in deoxyribonucleic acid coding regions may have a phenotypic effect on the carrier, e.g. by influencing susceptibility to disease. Detection of deleterious mutations via association studies is hampered by the large number of candidate sites; therefore methods are needed to narrow down the search to the most promising sites. For this, a possible approach is to use structural and sequence-based information of the encoded protein to predict whether a mutation at a particular site is likely to disrupt the functionality of the protein itself. We propose a hierarchical Bayesian multivariate adaptive regression spline (BMARS) model for supervised learning in this context and assess its predictive performance by using data from mutagenesis experiments on lac repressor and lysozyme proteins. In these experiments, about 12 amino-acid substitutions were performed at each native amino-acid position and the effect on protein functionality was assessed. The training data thus consist of repeated observations at each position, which the hierarchical framework is needed to account for. The model is trained on the lac repressor data and tested on the lysozyme mutations and vice versa. In particular, we show that the hierarchical BMARS model, by allowing for the clustered nature of the data, yields lower out-of-sample misclassification rates compared with both a BMARS and a frequen-tist MARS model, a support vector machine classifier and an optimally pruned classification tree.
引用
收藏
页码:191 / 206
页数:16
相关论文
共 34 条
[1]   Multiple domain protein diagnostic patterns [J].
Adams, RM ;
Das, S ;
Smith, TF .
PROTEIN SCIENCE, 1996, 5 (07) :1240-1249
[2]  
ALBERT J. H., 1996, BAYESIAN BIOSTATISTI, P577
[3]   Characterization of single-nucleotide polymorphisms in coding regions of human genes [J].
Cargill, M ;
Altshuler, D ;
Ireland, J ;
Sklar, P ;
Ardlie, K ;
Patil, N ;
Lane, CR ;
Lim, EP ;
Kalyanaraman, N ;
Nemesh, J ;
Ziaugra, L ;
Friedland, L ;
Rolfe, A ;
Warrington, J ;
Lipshutz, R ;
Daley, GQ ;
Lander, ES .
NATURE GENETICS, 1999, 22 (03) :231-238
[4]   An 147L substitution in the HOXD13 homeodomain causes a novel human limb malformation by producing a selective loss of function [J].
Caronia, G ;
Goodman, FR ;
McKeown, CME ;
Scambler, PJ ;
Zappavigna, V .
DEVELOPMENT, 2003, 130 (08) :1701-1712
[5]   Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: Structure-based assessment of amino acid variation [J].
Chasman, D ;
Adams, RM .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 307 (02) :683-706
[6]  
Denison D. G. T, 2002, BAYESIAN METHODS NON
[7]   Bayesian MARS [J].
Denison, DGT ;
Mallick, BK ;
Smith, AFM .
STATISTICS AND COMPUTING, 1998, 8 (04) :337-346
[8]   Arg2074Cys missense mutation in the C2 domain of factor V causing moderately severe factor V deficiency: molecular characterization by expression of the recombinant protein [J].
Duga, S ;
Montefusco, MC ;
Asselta, R ;
Malcovati, M ;
Peyvandi, F ;
Santagostino, E ;
Mannucci, PM ;
Tenchini, ML .
BLOOD, 2003, 101 (01) :173-177
[9]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[10]  
Friedman J., 2001, The elements of statistical learning, V1, DOI DOI 10.1007/978-0-387-21606-5