Local homology recognition and distance measures in linear time using compressed amino acid alphabets

被引:88
作者
Edgar, RC
机构
[1] Mill Valley, CA 94941
关键词
D O I
10.1093/nar/gkh180
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.
引用
收藏
页码:380 / 385
页数:6
相关论文
共 23 条
[1]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]  
[Anonymous], 1978, Atlas of protein sequence and structure
[4]  
Durbin R., 1998, BIOL SEQUENCE ANAL
[5]   AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) :10915-10919
[6]   Touring protein fold space with Dali/FSSP [J].
Holm, L ;
Sander, C .
NUCLEIC ACIDS RESEARCH, 1998, 26 (01) :316-319
[7]   THE RAPID GENERATION OF MUTATION DATA MATRICES FROM PROTEIN SEQUENCES [J].
JONES, DT ;
TAYLOR, WR ;
THORNTON, JM .
COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1992, 8 (03) :275-282
[8]   MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform [J].
Katoh, K ;
Misawa, K ;
Kuma, K ;
Miyata, T .
NUCLEIC ACIDS RESEARCH, 2002, 30 (14) :3059-3066
[9]  
Kimura Motoo., 1985, The Neutral Theory of Molecular Evolution
[10]   Reduction of protein sequence complexity by residue grouping [J].
Li, TP ;
Fan, K ;
Wang, J ;
Wang, W .
PROTEIN ENGINEERING, 2003, 16 (05) :323-330