A unified statistical framework for sequence comparison and structure comparison

被引:232
作者
Levitt, M [1 ]
Gerstein, M
机构
[1] Stanford Univ, Dept Biol Struct, Stanford, CA 94305 USA
[2] Yale Univ, Dept Mol Biophys & Biochem, New Haven, CT 06520 USA
关键词
sequence analysis; structure analysis; told family; database statistics; protein evolution;
D O I
10.1073/pnas.95.11.5913
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate, The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.
引用
收藏
页码:5913 / 5920
页数:8
相关论文
共 47 条
[1]  
ABOLA SJ, 1997, METHOD ENZYMOL, V277, P556
[2]   ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES [J].
ALTSCHUL, SF ;
BOGUSKI, MS ;
GISH, W ;
WOOTTON, JC .
NATURE GENETICS, 1994, 6 (02) :119-129
[3]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[4]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[5]   SEARCHING TECHNIQUES FOR DATABASES OF PROTEIN SECONDARY STRUCTURES [J].
ARTYMIUK, PJ ;
RICE, DW ;
MITCHELL, EM ;
WILLETT, P .
JOURNAL OF INFORMATION SCIENCE, 1989, 15 (4-5) :287-298
[6]   PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES [J].
BERNSTEIN, FC ;
KOETZLE, TF ;
WILLIAMS, GJB ;
MEYER, EF ;
BRICE, MD ;
RODGERS, JR ;
KENNARD, O ;
SHIMANOUCHI, T ;
TASUMI, M .
JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) :535-542
[7]  
Bookstein F.L., 1991, Morphometric tools for landmark data
[8]  
BRENNER S, 1998, IN PRESS P NATL ACAD
[9]   GENE DUPLICATIONS IN HAEMOPHILUS-INFLUENZAE [J].
BRENNER, SE ;
HUBBARD, T ;
MURZIN, A ;
CHOTHIA, C .
NATURE, 1995, 378 (6553) :140-140
[10]  
Brenner SE, 1996, METHOD ENZYMOL, V266, P635