Sequence context-specific profiles for homology searching

被引:120
作者
Biegert, A. [1 ,2 ]
Soeding, J. [1 ,2 ]
机构
[1] Univ Munich, Gene Ctr Munich, D-81377 Munich, Germany
[2] Univ Munich, Ctr Integrated Prot Sci, D-81377 Munich, Germany
关键词
alignment; pseudocounts; substitution matrix; context-sensitive; ACID SUBSTITUTION MATRICES; LOCAL-STRUCTURE; TWILIGHT-ZONE; ALIGNMENT; PROTEINS; CLASSIFICATION; RECOGNITION; SENSITIVITY; TABLES; TOOL;
D O I
10.1073/pnas.0810767106
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Sequence alignment and database searching are essential tools in biology because a protein's function can often be inferred from homologous proteins. Standard sequence comparison methods use substitution matrices to find the alignment with the best sum of similarity scores between aligned residues. These similarity scores do not take the local sequence context into account. Here, we present an approach that derives context-specific amino acid similarities from short windows centered on each query sequence residue. Our results demonstrate that the sequence context contains much more information about the expected mutations than just the residue itself. By employing our context-specific similarities (CS-BLAST) in combination with NCBI BLAST, we increase the sensitivity more than 2-fold on a difficult benchmark set, without loss of speed. Alignment quality is likewise improved significantly. Furthermore, we demonstrate considerable improvements when applying this paradigm to sequence profiles: Two iterations of CSI-BLAST, our context-specific version of PSI-BLAST, are more sensitive than 5 iterations of PSI-BLAST. The paradigm for biological sequence comparison presented here is very general. It can replace substitution matrices in sequence- and profile-based alignment and search methods for both protein and nucleotide sequences.
引用
收藏
页码:3770 / 3775
页数:6
相关论文
共 44 条
[1]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[4]   Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins [J].
Baussand, J. ;
Deremble, C. ;
Carbone, A. .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2007, 67 (03) :695-708
[5]  
Benson DA, 2017, NUCLEIC ACIDS RES, V45, pD37, DOI [10.1093/nar/gkl986, 10.1093/nar/gkw1070, 10.1093/nar/gkg057, 10.1093/nar/gks1195, 10.1093/nar/gkp1024, 10.1093/nar/gkq1079, 10.1093/nar/gkr1202, 10.1093/nar/gkx1094, 10.1093/nar/gkn723]
[6]   Pairwise alignment incorporating dipeptide covariation [J].
Crooks, GE ;
Green, RE ;
Brenner, SE .
BIOINFORMATICS, 2005, 21 (19) :3704-3710
[7]  
Dayhoff MO, 1978, ATLAS PROTEIN SEQ S3, V5, P345
[8]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[9]  
DURBIN R, 1998, BIOL SEQUENCE ANAL P, P117
[10]   Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763