Fast Statistical Alignment

被引:232
作者
Bradley, Robert K. [1 ,2 ]
Roberts, Adam [3 ]
Smoot, Michael [4 ]
Juvekar, Sudeep [3 ]
Do, Jaeyoung [5 ]
Dewey, Colin [5 ,6 ]
Holmes, Ian [7 ]
Pachter, Lior [1 ,2 ]
机构
[1] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Mol & Cellular Biol, Berkeley, CA 94720 USA
[3] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[4] Univ Calif San Diego, Dept Bioengn, San Diego, CA 92103 USA
[5] Univ Wisconsin, Dept Comp Sci, Madison, WI 53706 USA
[6] Univ Wisconsin, Dept Biostat & Med Informat, Madison, WI USA
[7] Univ Calif Berkeley, Dept Bioengn, Berkeley, CA 94720 USA
关键词
MULTIPLE SEQUENCE ALIGNMENT; HIDDEN MARKOV-MODELS; GENOMIC SEQUENCES; CLUSTAL-W; DNA; PHYLOGENY; BENCHMARK; EVOLUTION; DATABASE; GENES;
D O I
10.1371/journal.pcbi.1000392
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment-previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches-yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.
引用
收藏
页数:15
相关论文
共 66 条
  • [1] Abdeddaïm S, 1997, LECT NOTES COMPUT SC, V1264, P167
  • [2] ABDEDDAIM S, 2001, JOBIM 00, P1
  • [3] ADAMNOVAK, STATALIGN EXTENDABLE
  • [4] [Anonymous], CONDOR
  • [5] TEXshade:: shading and labeling of multiple sequence alignments using LATEX 2ε
    Beitz, E
    [J]. BIOINFORMATICS, 2000, 16 (02) : 135 - 139
  • [6] Aligning multiple genomic sequences with the threaded blockset aligner
    Blanchette, M
    Kent, WJ
    Riemer, C
    Elnitski, L
    Smit, AFA
    Roskin, KM
    Baertsch, R
    Rosenbloom, K
    Clawson, H
    Green, ED
    Haussler, D
    Miller, W
    [J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715
  • [7] Transducers: an emerging probabilistic framework for modeling indels on trees
    Bradley, Robert K.
    Holmes, Ian
    [J]. BIOINFORMATICS, 2007, 23 (23) : 3258 - 3262
  • [8] Specific alignment of structured RNA: stochastic grammars and sequence annealing
    Bradley, Robert K.
    Pachter, Lior
    Holmes, Ian
    [J]. BIOINFORMATICS, 2008, 24 (23) : 2677 - 2683
  • [9] MAVID: Constrained ancestral alignment of multiple sequences
    Bray, N
    Pachter, L
    [J]. GENOME RESEARCH, 2004, 14 (04) : 693 - 699
  • [10] The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences
    Brudno, M
    Steinkamp, R
    Morgenstern, B
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : W41 - W44