Parameters for accurate genome alignment

被引:154
作者
Frith, Martin C. [1 ]
Hamada, Michiaki [1 ,2 ]
Horton, Paul [1 ]
机构
[1] Inst Adv Ind Sci & Technol, Computat Biol Res Ctr, Tokyo 1350064, Japan
[2] Mizuho Informat & Res Inst Inc, Chiyoda Ku, Tokyo 1018443, Japan
来源
BMC BIOINFORMATICS | 2010年 / 11卷
关键词
SEQUENCE ALIGNMENT; DATABASE; UNCERTAINTY; IDENTIFY; ELEMENTS; MOUSE; BLAST;
D O I
10.1186/1471-2105-11-80
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results: We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that gamma-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions: These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
引用
收藏
页数:14
相关论文
共 46 条
[1]   AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE [J].
ALTSCHUL, SF .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) :555-565
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[4]   Fast Statistical Alignment [J].
Bradley, Robert K. ;
Roberts, Adam ;
Smoot, Michael ;
Juvekar, Sudeep ;
Do, Jaeyoung ;
Dewey, Colin ;
Holmes, Ian ;
Pachter, Lior .
PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (05)
[5]   Automated whole-genome multiple alignment of rat, mouse, and human [J].
Brudno, M ;
Poliakov, A ;
Salamov, A ;
Cooper, GM ;
Sidow, A ;
Rubin, EM ;
Solovyev, V ;
Batzoglou, S ;
Dubchak, I .
GENOME RESEARCH, 2004, 14 (04) :685-692
[6]   Centroid estimation in discrete high-dimensional spaces with applications in biology [J].
Carvalho, Luis E. ;
Lawrence, Charles E. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (09) :3209-3214
[7]  
Chiaromonte F, 2002, Pac Symp Biocomput, P115
[8]   Parametric alignment of Drosophila genomes [J].
Dewey, Colin N. ;
Huggins, Peter M. ;
Woods, Kevin ;
Sturmfels, Bernd ;
Pachter, Lior .
PLOS COMPUTATIONAL BIOLOGY, 2006, 2 (06) :606-614
[9]  
Durbin Richard., 1999, Bi- ological sequence analysis: probabilistic models of proteins and Necleic acides
[10]   Most mammalian mRNAs are conserved targets of microRNAs [J].
Friedman, Robin C. ;
Farh, Kyle Kai-How ;
Burge, Christopher B. ;
Bartel, David P. .
GENOME RESEARCH, 2009, 19 (01) :92-105