Significance of Z-value statistics of Smith-Waterman scores for protein alignments

被引:52
作者
Comet, JP
Aude, JC
Glémet, E
Risler, JL
Hénaut, A
Slonimski, PP
Codani, JJ
机构
[1] Inst Natl Rech Informat & Automat, F-78153 Le Chesnay, France
[2] CNRS, Ctr Genet Mol, F-91198 Gif Sur Yvette, France
[3] Univ Versailles, F-78035 Versailles, France
来源
COMPUTERS & CHEMISTRY | 1999年 / 23卷 / 3-4期
关键词
sequence alignment; dynamic programming; significance; Z-value; Gumbel distribution; Pareto distribution;
D O I
10.1016/S0097-8485(99)00008-X
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large datasets of protein sequences, leading to a law of probability that the experimental Z-values follow. First, we determine the relationships between the computed Z-value, an estimation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid composition as the real ones) lead to Z-value distributions that statistically fit the extreme value distribution, more precisely the Gumbel distribution (global EVD, Extreme Value Distribution). However, for real protein sequences, we observe an over-representation of high Z-values. We determine first a cutoff value which separates these overestimated Z-values from those which follow the global EVD. We then show that the interesting part of the tail of distribution of Z-values can be approximated by another EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law. This has been confirmed for all proteins analysed so far, whether extracted from individual genomes, or from the ensemble of five complete microbial genomes comprising altogether 16956 protein sequences. (C) 1999 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:317 / 331
页数:15
相关论文
共 31 条
[1]   Do aligned sequences share the same fold? [J].
Abagyan, RA ;
Batalov, S .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 273 (01) :355-368
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]  
Codani JJ, 1999, METHOD MICROBIOL, V28, P229
[4]  
Dayhoff M., 1978, ATLAS PROTEIN SEQ ST, V5, P353
[5]  
DEVEREUX J, 1989, GCG SEQUENCE ANAL SO
[6]   Evolution of genes, evolution of species:: The case of aminoacyl-tRNA synthetases [J].
Diaz-Lazcoz, Y ;
Aude, JC ;
Nitschké, P ;
Chiapello, H ;
Landès-Devauchelle, C ;
Risler, JL .
MOLECULAR BIOLOGY AND EVOLUTION, 1998, 15 (11) :1548-1561
[8]  
Glemet E, 1997, COMPUT APPL BIOSCI, V13, P137
[9]  
GONNET GH, 1992, SCIENCE, V256, P1433
[10]  
Gumbel E. J., 1958, Statistics of Extremes