Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning

被引:11
作者
Adhikari, Badri [1 ]
Hou, Jie [2 ]
Cheng, Jianlin [2 ]
机构
[1] Univ Missouri, Dept Math & Comp Sci, Columbia, MO USA
[2] Univ Missouri, Dept Elect Engn & Comp Sci, Columbia, MO 65211 USA
关键词
CASP; coevolution; deep learning; machine learning; multiple sequence alignment; protein contact prediction; RESIDUE-RESIDUE CONTACTS; RECONSTRUCTION; NETWORKS; MAPS;
D O I
10.1002/prot.25405
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66.
引用
收藏
页码:84 / 96
页数:13
相关论文
共 23 条
[1]   ConEVA: a toolbox for comprehensive assessment of protein contacts [J].
Adhikari, Badri ;
Nowotny, Jackson ;
Bhattacharya, Debswapna ;
Hou, Jie ;
Cheng, Jianlin .
BMC BIOINFORMATICS, 2016, 17
[2]   CONFOLD: Residue-residue contact-guided ab initio protein folding [J].
Adhikari, Badri ;
Bhattacharya, Debswapna ;
Cao, Renzhi ;
Cheng, Jianlin .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2015, 83 (08) :1436-1449
[3]   Improved residue contact prediction using support vector machines and a large feature set [J].
Cheng, Jianlin ;
Baldi, Pierre .
BMC BIOINFORMATICS, 2007, 8 (1)
[4]   Deep architectures for protein contact map prediction [J].
Di Lena, Pietro ;
Nagata, Ken ;
Baldi, Pierre .
BIOINFORMATICS, 2012, 28 (19) :2449-2457
[5]   Optimal contact definition for reconstruction of Contact Maps [J].
Duarte, Jose M. ;
Sathyapriya, Rajagopal ;
Stehr, Henning ;
Filippis, Ioannis ;
Lappe, Michael .
BMC BIOINFORMATICS, 2010, 11
[6]   A study and benchmark of DNcon: a method for protein residue-residue contact prediction using deep networks [J].
Eickholt, Jesse ;
Cheng, Jianlin .
BMC BIOINFORMATICS, 2013, 14
[7]   Predicting protein residue-residue contacts using deep networks and boosting [J].
Eickholt, Jesse ;
Cheng, Jianlin .
BIOINFORMATICS, 2012, 28 (23) :3066-3072
[8]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[9]   Hidden Markov model speed heuristic and iterative HMM search procedure [J].
Johnson, L. Steven ;
Eddy, Sean R. ;
Portugaly, Elon .
BMC BIOINFORMATICS, 2010, 11
[10]   MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins [J].
Jones, David T. ;
Singh, Tanya ;
Kosciolek, Tomasz ;
Tetchner, Stuart .
BIOINFORMATICS, 2015, 31 (07) :999-1006