DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment

被引:170
作者
Subramanian, Amarendran R. [1 ]
Kaufmann, Michael [1 ]
Morgenstern, Burkhard [2 ]
机构
[1] Univ Tubingen, Wilhelm Schickard Inst Informat, D-72076 Tubingen, Germany
[2] Univ Gottingen, Inst Microbiol & Genet, D-37077 Gottingen, Germany
关键词
D O I
10.1186/1748-7188-3-6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: DIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach. Results: Our new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called DIRM-BASE 1. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences. Conclusion: On BALIBASE 3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.
引用
收藏
页数:11
相关论文
共 40 条
[1]  
ABDEDDAIM S, 2001, LECT NOTES COMPUTER, V2066, P1
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences [J].
Brudno, M ;
Steinkamp, R ;
Morgenstern, B .
NUCLEIC ACIDS RESEARCH, 2004, 32 :W41-W44
[4]   Fast and sensitive multiple alignment of large genomic sequences -: art. no. 66 [J].
Brudno, M ;
Chapman, M ;
Göttgens, B ;
Batzoglou, S ;
Morgenstern, B .
BMC BIOINFORMATICS, 2003, 4 (1)
[5]   A MODIFICATION OF THE GREEDY ALGORITHM FOR VERTEX COVER [J].
CLARKSON, KL .
INFORMATION PROCESSING LETTERS, 1983, 16 (01) :23-25
[6]  
COREL E, 2007, 1 INT S OPT SYST BIO, P189
[7]   MULTIPLE SEQUENCE ALIGNMENT WITH HIERARCHICAL-CLUSTERING [J].
CORPET, F .
NUCLEIC ACIDS RESEARCH, 1988, 16 (22) :10881-10890
[8]   Local decoding of sequences and alignment-free comparison [J].
Didier, Gilles ;
Laprevotte, Ivan ;
Pupin, Maude ;
Henaut, Alain .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2006, 13 (08) :1465-1476
[9]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[10]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797