Accurate formula for p-values of gapped local sequence and profile alignments

被引:87
作者
Mott, R [1 ]
机构
[1] Wellcome Trust Res Labs, Ctr Human Genet, Oxford OX3 7BN, England
关键词
statistical significance; protein sequence; protein profile; sequence alignment;
D O I
10.1006/jmbi.2000.3875
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (i.e. gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of databank searches. The method is based on the theoretical ideas introduced by R. Mott and R. Tribe in 1999. Extensive simulation studies show that score-thresholds produced by the method are accurate to within +/-5% 95% of the time. We also investigate factors which effect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently, it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood. (C) 2000 Academic Press.
引用
收藏
页码:649 / 659
页数:11
相关论文
共 37 条
[21]   Approximate statistics of gapped alignments [J].
Mott, R ;
Tribe, R .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (01) :91-112
[22]  
MOTT R, 1992, B MATH BIOL, V54, P59, DOI 10.1007/BF02458620
[23]   Local sequence alignments with monotonic gap penalties [J].
Mott, R .
BIOINFORMATICS, 1999, 15 (06) :455-462
[24]  
MURZIN AG, 1995, J MOL BIOL, V247, P536, DOI 10.1016/S0022-2836(05)80134-2
[25]  
OLSEN R, 1999, 7 INT C INT SYST MOL, P303
[26]   Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods [J].
Park, J ;
Karplus, K ;
Barrett, C ;
Hughey, R ;
Haussler, D ;
Hubbard, T ;
Chothia, C .
JOURNAL OF MOLECULAR BIOLOGY, 1998, 284 (04) :1201-1210
[27]   Empirical statistical estimates for sequence similarity searches [J].
Pearson, WR .
JOURNAL OF MOLECULAR BIOLOGY, 1998, 276 (01) :71-84
[28]   IMPROVED TOOLS FOR BIOLOGICAL SEQUENCE COMPARISON [J].
PEARSON, WR ;
LIPMAN, DJ .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1988, 85 (08) :2444-2448
[29]   DISTRIBUTION OF GLUTAMINE AND ASPARAGINE RESIDUES AND THEIR NEAR NEIGHBORS IN PEPTIDES AND PROTEINS [J].
ROBINSON, AB ;
ROBINSON, LR .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1991, 88 (20) :8880-8884
[30]  
Schäffer AA, 1999, BIOINFORMATICS, V15, P1000