Homology-based method for identification of protein repeats using statistical significance estimates

被引:156
作者
Andrade, MA
Ponting, CP
Gibson, TJ
Bork, P
机构
[1] European Mol Biol Lab, D-69012 Heidelberg, Germany
[2] Max Delbruck Ctr Mol Med, Dept Bioinformat, D-13092 Berlin, Germany
[3] Univ Oxford, Dept Human Anat & Genet, MRC, Funct Genet Unit, Oxford OX1 3QX, England
关键词
protein repeats; homology; sub-optimal alignment; extreme value distribution; sequence analysis;
D O I
10.1006/jmbi.2000.3684
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families. (C) 2000 Academic Press.
引用
收藏
页码:521 / 537
页数:17
相关论文
共 100 条
[1]   Muskelin, a novel intracellular mediator of cell adhesive and cytoskeletal responses to thrombospondin-1 [J].
Adams, JC ;
Seed, B ;
Lawler, J .
EMBO JOURNAL, 1998, 17 (17) :4964-4974
[2]   ISSUES IN SEARCHING MOLECULAR SEQUENCE DATABASES [J].
ALTSCHUL, SF ;
BOGUSKI, MS ;
GISH, W ;
WOOTTON, JC .
NATURE GENETICS, 1994, 6 (02) :119-129
[3]  
Altschul SF, 1996, METHOD ENZYMOL, V266, P460
[4]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[5]   HEAT REPEATS IN THE HUNTINGTONS-DISEASE PROTEIN [J].
ANDRADE, MA ;
BORK, P .
NATURE GENETICS, 1995, 11 (02) :115-116
[6]  
[Anonymous], CURR OPIN STRUCT BIO
[7]   Complex formation by all five homologues of mammalian translation initiation factor 3 subunits from yeast Saccharomyces cerevisiae [J].
Asano, K ;
Phan, L ;
Anderson, J ;
Hinnebusch, AG .
JOURNAL OF BIOLOGICAL CHEMISTRY, 1998, 273 (29) :18573-18585
[8]   Conservation and diversity of eukaryotic translation initiation factor eIF3 [J].
Asano, K ;
Kinzy, TG ;
Merrick, WC ;
Hershey, JWB .
JOURNAL OF BIOLOGICAL CHEMISTRY, 1997, 272 (02) :1101-1109
[9]   Molecular analysis of the SNF2/SWI2 protein family member MOT1, an ATP-driven enzyme that dissociates TATA-binding protein from DNA [J].
Auble, DT ;
Wang, DY ;
Post, KW ;
Hahn, S .
MOLECULAR AND CELLULAR BIOLOGY, 1997, 17 (08) :4842-4851
[10]   The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 [J].
Bairoch, A ;
Apweiler, R .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :49-54