Learning generative models for protein fold families

被引:224
作者
Balakrishnan, Sivaraman [1 ]
Kamisetty, Hetunandan [2 ]
Carbonell, Jaime G. [1 ,2 ,3 ]
Lee, Su-In [4 ,5 ]
Langmead, Christopher James [2 ,3 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
[3] Carnegie Mellon Univ, Lane Ctr Computat Biol, Pittsburgh, PA 15213 USA
[4] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
[5] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
基金
美国国家科学基金会;
关键词
protein sequence; probabilistic graphical models; Markov random fields; regularization; generative model; HIDDEN MARKOV-MODELS; CORRELATED MUTATIONS; GRAPHICAL MODELS; RESIDUES; NETWORKS;
D O I
10.1002/prot.22934
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position- specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e. g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.
引用
收藏
页码:1061 / 1078
页数:18
相关论文
共 41 条
[1]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[2]   COORDINATED AMINO-ACID CHANGES IN HOMOLOGOUS PROTEIN FAMILIES [J].
ALTSCHUH, D ;
VERNET, T ;
BERTI, P ;
MORAS, D ;
NAGAI, K .
PROTEIN ENGINEERING, 1988, 2 (03) :193-199
[3]  
[Anonymous], BIOKDD 05 P 5 INT WO
[4]  
[Anonymous], 1988, IMA VOL MATH APPL
[5]  
[Anonymous], 2007, Advances in Neural Information Processing Systems (NeurIPS)
[6]  
[Anonymous], 23 ANN C UNC ART INT
[7]  
[Anonymous], 2006, Journal of the Royal Statistical Society, Series B
[8]  
[Anonymous], PROTEINS STRUCT FUNC
[9]  
[Anonymous], 2007, ADV NEURAL INFORM PR
[10]  
[Anonymous], CVPR IEEE COMPUTER S