Toward an accurate statistics of gapped alignments

被引:13
作者
Kschischo, M [1 ]
Lässig, M
Yu, YK
机构
[1] Natl Lib Med, Natl Ctr Biotechnol Informat, NIH, Bethesda, MD 20894 USA
[2] Univ Appl Sci Koblenz, D-53424 Remagen, Germany
[3] Univ Cologne, Inst Theoret Phys, D-50937 Cologne, Germany
[4] Florida Atlantic Univ, Dept Phys, Boca Raton, FL 33431 USA
关键词
D O I
10.1016/j.bulm.2004.07.001
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Sequence alignment has been an invaluable toot for finding homologous sequences. The significance of the homology found is often quantified statistically by p-values. Theory for computing p-values exists for gapless alignments [Karlin, S., Altschul, S.F., 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264-2268; Karlin, S., Dembo A., 1992. Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Probab. 24, 13-140], but a full generalization to alignments with gaps is not yet complete. We present a unified statistical analysis of two common sequence comparison algorithms: maximum-score (Smith-Waterman) alignments and their generalized probabilistic counterparts, including maximum-likelihood alignments and hidden Markov models. The most important statistical characteristic of these algorithms is the distribution function of the maximum score S-max, resp. the maximum free energy F-max, for mutually uncorrelated random sequences. This distribution is known empirically to be of the Gumbel form with an exponential tail P(S-max > x) similar to exp(-lambdax) for maximum-score alignment and P(F-max > x) similar to exp(-lambdax) for some classes of probabilistic alignment. We derive an exact expression for; for particular probabilistic alignments. This result is then used to obtain accurate. values for generic probabilistic and maximum-score alignments. Although the result demonstrated uses a simple match-mismatch scoring system, it is expected to be a good starting point for more general scoring functions. (C) 2004 Society for Mathematical Biology. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:169 / 191
页数:23
相关论文
共 28 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]   `A PHASE TRANSITION FOR THE SCORE IN MATCHING RANDOM SEQUENCES ALLOWING DELETIONS [J].
Arratia, Richard ;
Waterman, Michael S. .
ANNALS OF APPLIED PROBABILITY, 1994, 4 (01) :200-225
[3]   Asymmetric exclusion process and extremal statistics of random sequences [J].
Bundschuh, R .
PHYSICAL REVIEW E, 2002, 65 (03)
[4]  
Dayhoff M.O., 1978, ATLAS PROTEIN SEQ ST, V5
[5]   Scaling laws and similarity detection in sequence alignment with gaps [J].
Drasdo, D ;
Hwa, T ;
Lässig, M .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2000, 7 (1-2) :115-141
[6]  
DRASDO D, 1998, P 6 INT C INT SYST M, P52
[7]  
Durbin R., 1998, Biological sequence analysis: Probabilistic models of proteins and nucleic acids
[8]   Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763
[9]   DIRECTED WAVES IN RANDOM-MEDIA - AN ANALYTICAL CALCULATION [J].
FRIEDBERG, R ;
YU, YK .
PHYSICAL REVIEW E, 1994, 49 (06) :5755-5762
[10]  
Gumbel E. J., 1958, Statistics of Extremes