Sequence embedding for fast construction of guide trees for multiple sequence alignment

被引:70
作者
Blackshields, Gordon [1 ]
Sievers, Fabian [1 ]
Shi, Weifeng [1 ]
Wilm, Andreas [1 ]
Higgins, Desmond G. [1 ]
机构
[1] Univ Coll Dublin, UCD Conway Inst Biomol & Biomed Sci, Dublin 4, Ireland
基金
爱尔兰科学基金会;
关键词
CLUSTAL-W; DATABASE; MAFFT; ACID;
D O I
10.1186/1748-7188-5-21
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N-2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. Results: In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. Conclusions: We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.
引用
收藏
页数:11
相关论文
共 34 条
[1]   Fast embedding methods for clustering tens of thousands of sequences [J].
Blackshields, Gordon ;
Larkin, Mark ;
Wallace, Iain M. ;
Wilm, Andreas ;
Higgins, Desmond G. .
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2008, 32 (04) :282-286
[2]   The Ribosomal Database Project: improved alignments and new tools for rRNA analysis [J].
Cole, J. R. ;
Wang, Q. ;
Cardenas, E. ;
Fish, J. ;
Chai, B. ;
Farris, R. J. ;
Kulam-Syed-Mohideen, A. S. ;
McGarrell, D. M. ;
Marsh, T. ;
Garrity, G. M. ;
Tiedje, J. M. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D141-D145
[3]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[4]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[5]  
FELSENSTEIN J, 1989, CLADISTICS, V5, P166
[6]   PROGRESSIVE SEQUENCE ALIGNMENT AS A PREREQUISITE TO CORRECT PHYLOGENETIC TREES [J].
FENG, DF ;
DOOLITTLE, RF .
JOURNAL OF MOLECULAR EVOLUTION, 1987, 25 (04) :351-360
[7]   Pfam:: clans, web tools and services [J].
Finn, Robert D. ;
Mistry, Jaina ;
Schuster-Bockler, Benjamin ;
Griffiths-Jones, Sam ;
Hollich, Volker ;
Lassmann, Timo ;
Moxon, Simon ;
Marshall, Mhairi ;
Khanna, Ajay ;
Durbin, Richard ;
Eddy, Sean R. ;
Sonnhammer, Erik L. L. ;
Bateman, Alex .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D247-D251
[8]   SOME DISTANCE PROPERTIES OF LATENT ROOT AND VECTOR METHODS USED IN MULTIVARIATE ANALYSIS [J].
GOWER, JC .
BIOMETRIKA, 1966, 53 :325-&
[9]   Rfam: annotating non-coding RNAs in complete genomes [J].
Griffiths-Jones, S ;
Moxon, S ;
Marshall, M ;
Khanna, A ;
Eddy, SR ;
Bateman, A .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D121-D124
[10]   THE ALIGNMENT OF SETS OF SEQUENCES AND THE CONSTRUCTION OF PHYLETIC TREES - AN INTEGRATED METHOD [J].
HOGEWEG, P ;
HESPER, B .
JOURNAL OF MOLECULAR EVOLUTION, 1984, 20 (02) :175-186