Fast Statistical Alignment

被引：232

作者：

Bradley, Robert K. ^{[1
,2
]}

Roberts, Adam ^{[3
]}

Smoot, Michael ^{[4
]}

Juvekar, Sudeep ^{[3
]}

Do, Jaeyoung ^{[5
]}

Dewey, Colin ^{[5
,6
]}

Holmes, Ian ^{[7
]}

Pachter, Lior ^{[1
,2
]}

机构：

[1] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA

[2] Univ Calif Berkeley, Dept Mol & Cellular Biol, Berkeley, CA 94720 USA

[3] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA

[4] Univ Calif San Diego, Dept Bioengn, San Diego, CA 92103 USA

[5] Univ Wisconsin, Dept Comp Sci, Madison, WI 53706 USA

[6] Univ Wisconsin, Dept Biostat & Med Informat, Madison, WI USA

[7] Univ Calif Berkeley, Dept Bioengn, Berkeley, CA 94720 USA

来源：

PLOS COMPUTATIONAL BIOLOGY | 2009年 / 5卷 / 05期

关键词：

MULTIPLE SEQUENCE ALIGNMENT; HIDDEN MARKOV-MODELS; GENOMIC SEQUENCES; CLUSTAL-W; DNA; PHYLOGENY; BENCHMARK; EVOLUTION; DATABASE; GENES;

D O I：

10.1371/journal.pcbi.1000392

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment-previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches-yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.

引用

页数：15

共 66 条

[1] Abdeddaïm S, 1997, LECT NOTES COMPUT SC, V1264, P167
[2] ABDEDDAIM S, 2001, JOBIM 00, P1
[3] ADAMNOVAK, STATALIGN EXTENDABLE
[4] [Anonymous], CONDOR
[5] TEXshade:: shading and labeling of multiple sequence alignments using LATEX 2ε
Beitz, E
[J]. BIOINFORMATICS, 2000, 16 (02) : 135 - 139
[6] Aligning multiple genomic sequences with the threaded blockset aligner
Blanchette, M
Kent, WJ
Riemer, C
Elnitski, L
Smit, AFA
Roskin, KM
Baertsch, R
Rosenbloom, K
Clawson, H
Green, ED
Haussler, D
Miller, W
[J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715
[7] Transducers: an emerging probabilistic framework for modeling indels on trees
Bradley, Robert K.
Holmes, Ian
[J]. BIOINFORMATICS, 2007, 23 (23) : 3258 - 3262
[8] Specific alignment of structured RNA: stochastic grammars and sequence annealing
Bradley, Robert K.
Pachter, Lior
Holmes, Ian
[J]. BIOINFORMATICS, 2008, 24 (23) : 2677 - 2683
[9] MAVID: Constrained ancestral alignment of multiple sequences
Bray, N
Pachter, L
[J]. GENOME RESEARCH, 2004, 14 (04) : 693 - 699
[10] The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences
Brudno, M
Steinkamp, R
Morgenstern, B
[J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : W41 - W44

← 1 2 3 4 5 6 7 →