Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models

被引:472
作者
Ekeberg, Magnus [1 ]
Lovkvist, Cecilia [2 ]
Lan, Yueheng [3 ]
Weigt, Martin [4 ]
Aurell, Erik [2 ,5 ,6 ]
机构
[1] KTH Royal Inst Technol, Engn Phys Program, S-10044 Stockholm, Sweden
[2] AlbaNova Univ Ctr, Dept Computat Biol, S-10691 Stockholm, Sweden
[3] Tsinghua Univ, Dept Phys, Beijing 100084, Peoples R China
[4] Univ Paris 06, UMR7238, Lab Genom Microorganismes, F-75006 Paris, France
[5] KTH Royal Inst Technol, ACCESS Linnaeus Ctr, S-10044 Stockholm, Sweden
[6] Aalto Univ, Dept Informat & Comp Sci, FI-00076 Aalto, Finland
基金
芬兰科学院;
关键词
INFORMATION;
D O I
10.1103/PhysRevE.87.012707
中图分类号
O35 [流体力学]; O53 [等离子体物理学];
学科分类号
070204 [等离子体物理]; 070301 [无机化学];
摘要
Spatially proximate amino acids in a protein tend to coevolve. A protein's three-dimensional (3D) structure hence leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from such correlations is an open problem in structural biology, pursued with increasing vigor as more and more protein sequences continue to fill the data banks. Within this task lies a statistical inference problem, rooted in the following: correlation between two sites in a protein sequence can arise from firsthand interaction but can also be network-propagated via intermediate sites; observed correlation is not enough to guarantee proximity. To separate direct from indirect interactions is an instance of the general problem of inverse statistical mechanics, where the task is to learn model parameters (fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In the context of protein sequences, the approach has been referred to as direct-coupling analysis. Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. This improved performance also relies on a modified score for the coupling strength. The results are verified using known crystal structures of specific sequence instances of various protein families. Code implementing the new method can be found at http://plmdca.csc.kth.se/. DOI: 10.1103/PhysRevE.87.012707
引用
收藏
页数:16
相关论文
共 53 条
[1]
ACKLEY DH, 1985, COGNITIVE SCI, V9, P147
[2]
[Anonymous], LECT NOTES MONOGRAPH
[3]
[Anonymous], 2007, Information and Complexity in Statistical Modeling
[4]
[Anonymous], INFORM GEOMETRY MEAN
[5]
[Anonymous], INDEPENDENT COMPONEN
[6]
[Anonymous], MATH P CAMBRIDGE PHI
[7]
Reorganizing the protein space at the Universal Protein Resource (UniProt) [J].
Apweiler, Rolf ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Antunes, Ricardo ;
Casanova, Elisabet Barrera ;
Bely, Benoit ;
Bingley, Mark ;
Bower, Lawrence ;
Bursteinas, Borisas ;
Chan, Wei Mun ;
Chavali, Gayatri ;
Da Silva, Alan ;
Dimmer, Emily ;
Eberhardt, Ruth ;
Fazzini, Francesco ;
Fedotov, Alexander ;
Garavelli, John ;
Castro, Leyla Garcia ;
Gardner, Michael ;
Hieta, Reija ;
Huntley, Rachael ;
Jacobsen, Julius ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
Orchard, Sandra ;
Patient, Samuel ;
Pichler, Klemens ;
Poggioli, Diego ;
Pontikos, Nikolas ;
Pundir, Sangya ;
Rosanoff, Steven ;
Sawford, Tony ;
Sehra, Harminder ;
Turner, Edward ;
Wardell, Tony ;
Watkins, Xavier ;
Corbett, Matt ;
Donnelly, Mike ;
van Rensburg, Pieter ;
Goujon, Mickael ;
McWilliam, Hamish ;
Lopez, Rodrigo ;
Xenarios, Ioannis ;
Bougueleret, Lydie ;
Bridge, Alan ;
Poux, Sylvain ;
Redaschi, Nicole .
NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) :D71-D75
[8]
Inverse Ising Inference Using All the Data [J].
Aurell, Erik ;
Ekeberg, Magnus .
PHYSICAL REVIEW LETTERS, 2012, 108 (09)
[9]
Learning generative models for protein fold families [J].
Balakrishnan, Sivaraman ;
Kamisetty, Hetunandan ;
Carbonell, Jaime G. ;
Lee, Su-In ;
Langmead, Christopher James .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2011, 79 (04) :1061-1078
[10]
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]