Toward an accurate statistics of gapped alignments

被引:13
作者
Kschischo, M [1 ]
Lässig, M
Yu, YK
机构
[1] Natl Lib Med, Natl Ctr Biotechnol Informat, NIH, Bethesda, MD 20894 USA
[2] Univ Appl Sci Koblenz, D-53424 Remagen, Germany
[3] Univ Cologne, Inst Theoret Phys, D-50937 Cologne, Germany
[4] Florida Atlantic Univ, Dept Phys, Boca Raton, FL 33431 USA
关键词
D O I
10.1016/j.bulm.2004.07.001
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Sequence alignment has been an invaluable toot for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13-140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S-max, resp. the maximum free energy F-max, for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S-max > x) similar to exp(-lambdax) for maximum-score alignment and P(F-max > x) similar to exp(-lambdax) for some classes of probabilistic alignment. We derive an exact expression for; for particular probabilistic alignments. This result is then used to obtain accurate. values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions. (C) 2004 Society for Mathematical Biology. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:169 / 191
页数:23
相关论文
共 28 条
[21]   IMPROVED TOOLS FOR BIOLOGICAL SEQUENCE COMPARISON [J].
PEARSON, WR ;
LIPMAN, DJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1988, 85 (08) :2444-2448
[22]  
Siegmund D, 2000, ANN STAT, V28, P657
[23]   Limits of homology detection by pairwise sequence comparison [J].
Spang, R ;
Vingron, M .
BIOINFORMATICS, 2001, 17 (04) :338-342
[24]   Hybrid alignment: high-performance with universal statistics [J].
Yu, YK ;
Bundschuh, R ;
Hwa, T .
BIOINFORMATICS, 2002, 18 (06) :864-872
[25]   Replica model for an unusual directed polymer in 1+1 dimensions and prediction of the extremal parameter of gapped sequence alignment statistics [J].
Yu, YK .
PHYSICAL REVIEW E, 2004, 69 (06) :061904-1
[26]   Statistical significance of probabilistic sequence alignment and related local hidden Markov models [J].
Yu, YK ;
Hwa, T .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2001, 8 (03) :249-282
[27]  
YU YK, 1999, STAT PHYS EVE 21 CEN
[28]   ALIGNMENT OF MOLECULAR SEQUENCES SEEN RANDOM PATH ANALYSIS [J].
ZHANG, MQ ;
MARR, TG .
JOURNAL OF THEORETICAL BIOLOGY, 1995, 174 (02) :119-129