Fundamentals of massive automatic pairwise alignments of protein sequences:: theoretical significance of Z-value statistics

被引：25

作者：

Bastien, O

Aude, JC

Roy, S

Maréchal, E

机构：

[1] Univ Grenoble 1, Physiol Cellulaire Vegetale Lab, Dept Reponse & Dynam Cellulaire, UMR 5168,CNRS,CEA,INRA,CEA Grenoble, F-38054 Grenoble 09, France

[2] Gene IT, F-92500 Rueil Malmaison, France

[3] CEA Saclay, Lab Bioinformat Genom & Modelisat, Dept Biol Joliot Curie, F-91191 Gif Sur Yvette, France

[4] CEA Grenoble, Serv Dev Bioinformat Sud Est, F-38054 Grenoble 09, France

来源：

BIOINFORMATICS | 2004年 / 20卷 / 04期

关键词：

D O I：

10.1093/bioinformatics/btg440

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation:Different automatic methods of sequence alignments are routinely used as a starting point for homology searches and function inference. Confidence in an alignment probability is one of the major fundamentals of massive automatic genome-scale pairwise comparisons, for clustering of putative orthologs and paralogs, sequenced genome annotation or multiple-genomic tree constructions. Extreme value distribution based on the Karlin-Altschul model, usually advised for large-scale comparisons are not always valid, particularly in the case of comparisons of non-biased with nucleotide-biased genomes (such that of Plasmodium falciparum). Z-values estimates based on Monte Carlo technics, can be calculated experimentally for any alignment output, whatever the method used. Empirically, a Z-value higher than similar to8 is supposed reasonable to assess that an alignment score is significant, but this arbitrary figure was never theoretically justified. Results: In this paper, we used the Bienayme-Chebyshev inequality to demonstrate a theorem of the upper limit of an alignment score probability (or P-value). This theorem implies that a computed Z-value is a statistical test, a single-linkage clustering criterion and that 1/Z-value(2) is an upper limit to the probability of an alignment score whatever the actual probability law is. Therefore, this study provides the missing theoretical link between a Z-value cut-off used for an automatic clustering of putative orthologs and/or paralogs, and the corresponding statistical risk in such genome-scale comparisons (using non-biased or biased genomes).

引用

页码：534 / 537

页数：4

共 26 条

[1] BASIC LOCAL ALIGNMENT SEARCH TOOL [J].

ALTSCHUL, SF ;

GISH, W ;

MILLER, W ;

MYERS, EW ;

LIPMAN, DJ .

JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410

[2] `A PHASE TRANSITION FOR THE SCORE IN MATCHING RANDOM SEQUENCES ALLOWING DELETIONS [J].

Arratia, Richard ;

Waterman, Michael S. .

ANNALS OF APPLIED PROBABILITY, 1994, 4 (01) :200-225

[3]

Aude JC, 2002, COMPUT CHEM, V26, P403, DOI 10.1016/S0097-8485(02)00003-7

[4] Sequence alignment:: an approximation law for the Z-value with applications to databank scanning [J].

Bacro, JN ;

Comet, JP .

COMPUTERS & CHEMISTRY, 2001, 25 (04) :401-410

[5]

Bienayme I.-J., 1853, Comptes Rendus de l'Academie des Sciences, V37, P309

[6] Low-complexity regions in Plasmodium proteins:: In search of a function [J].

Brocchieri, L .

GENOME RESEARCH, 2001, 11 (02) :195-197

[7]

Chebyshev PL., 1867, J MATH PURE APPL, V12, P177

[8]

Codani JJ, 1999, METHOD MICROBIOL, V28, P229

[9] Significance of Z-value statistics of Smith-Waterman scores for protein alignments [J].

Comet, JP ;

Aude, JC ;

Glémet, E ;

Risler, JL ;

Hénaut, A ;

Slonimski, PP ;

Codani, JJ .

COMPUTERS & CHEMISTRY, 1999, 23 (3-4) :317-331

[10]

DARDEL F, 2002, BIOINFORMATIQUE GENO

← 1 2 3 →