Optimal data collection for correlated mutation analysis

被引:22
作者
Ashkenazy, Haim [1 ,2 ]
Unger, Ron [2 ]
Kliger, Yossef [1 ]
机构
[1] Compugen Ltd, IL-69512 Tel Aviv, Israel
[2] Bar Ilan Univ, Mina & Everard Goodman Fac Life Sci, IL-52900 Ramat Gan, Israel
关键词
ab-initio structure prediction; correlated mutations; protein structure prediction; residue covariation; contact prediction; MULTIPLE SEQUENCE ALIGNMENT; PROTEIN-STRUCTURE PREDICTION; CONTACT PREDICTION; RESIDUE CONTACTS; MUTUAL INFORMATION; EVOLUTIONARY INFORMATION; COEVOLVING POSITIONS; FOLD RECOGNITION; CONSERVATION; PHYLOGENY;
D O I
10.1002/prot.22168
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The main objective of correlated mutation analysis (CMA) is to predict intra-protein residue-residue interactions from sequence alone. Despite considerable progress in algorithms and computer capabilities, the performance of CMA methods remains quite low. Here we examine whether, and to what extent, the quality of CMA methods depends on the sequences that are included in the multiple sequence alignment (MSA). The results revealed a strong correlation between the number of homologs in an MSA and CMA prediction strength. Furthermore, many of the current methods include only orthologs in the MSA, we found that it is beneficial to include both orthologs and paralogs in the MSA. Remarkably, even remote homologs contribute to the improved accuracy. Based on our findings we put forward an automated data collection procedure, with a minimal coverage of 50% between the query protein and its orthologs and paralogs. This procedure improves accuracy even in the absence of manual curation. In this era of massive sequencing and exploding sequence data, our results suggest that correlated mutation-based methods have not reached their inherent performance limitations and that the role of CMA in structural biology is far from being fulfilled.
引用
收藏
页码:545 / 555
页数:11
相关论文
共 47 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   GenBank [J].
Benson, Dennis A. ;
Karsch-Mizrachi, Ilene ;
Lipman, David J. ;
Ostell, James ;
Wheeler, David L. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D16-D20
[3]   The Protein Data Bank [J].
Berman, HM ;
Westbrook, J ;
Feng, Z ;
Gilliland, G ;
Bhat, TN ;
Weissig, H ;
Shindyalov, IN ;
Bourne, PE .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :235-242
[4]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[5]   A machine learning information retrieval approach to protein fold recognition [J].
Cheng, Jianlin ;
Baldi, Pierre .
BIOINFORMATICS, 2006, 22 (12) :1456-1463
[6]   Robust signals of coevolution of interacting residues in mammalian proteomes identified by phylogeny-aided structural analysis [J].
Choi, SS ;
Li, WM ;
Lahn, BT .
NATURE GENETICS, 2005, 37 (12) :1367-1371
[7]   A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments [J].
Dekker, JP ;
Fodor, A ;
Aldrich, RW ;
Yellen, G .
BIOINFORMATICS, 2004, 20 (10) :1565-1572
[8]   Detecting coevolving amino acid sites using Bayesian mutational mapping [J].
Dimmic, MW ;
Hubisz, MJ ;
Bustamante, CD ;
Nielsen, R .
BIOINFORMATICS, 2005, 21 :I126-I135
[9]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[10]   Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction [J].
Dunn, S. D. ;
Wahl, L. M. ;
Gloor, G. B. .
BIOINFORMATICS, 2008, 24 (03) :333-340