SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes

被引:36
作者
Pandit, SB
Gosar, D
Abhiman, S
Sujatha, S
Dixit, SS
Mhatre, NS
Sowdhamini, R
Srinivasan, N [1 ]
机构
[1] Indian Inst Sci, Mol Biophys Unit, Bangalore 560012, Karnataka, India
[2] Tata Inst Fundamental Res, Natl Ctr Biol Sci, Bangalore 560065, Karnataka, India
[3] Indian Inst Technol, Ctr Biotechnol, Powai 400076, Mumbai, India
关键词
D O I
10.1093/nar/30.1.289
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Members of a superfamily of proteins could result from divergent evolution of homologues with insignificant similarity in the amino acid sequences. A superfamily relationship is detected commonly after the three-dimensional structures of the proteins are determined using X-ray analysis or NMR. The SUPFAM database described here relates two homologous protein families in a multiple sequence alignment database of either known or unknown structure. The present release (1.1), which is the first version of the SUPFAM database, has been derived by analysing Pfam, which is one of the commonly used databases of multiple sequence alignments of homologous proteins. The first step in establishing SUPFAM is to relate Pfam families with the families in PALI, which is an alignment database of homologous proteins of known structure that is derived largely from SCOP. The second step involves relating Pfam families which could not be associated reliably with a protein superfamily of known structure. The profile matching procedure, IMPALA, has been used in these steps. The first step resulted in identification of 1280 Pfam families (out of 2697, i.e. 47%) which are related, either by close homologous connection to a SCOP family or by distant relationship to a SCOP family, potentially forming new superfamily connections. Using the profiles of 1417 Pfam families with apparently no structural information, an all-against-all comparison involving a sequence-profile match using IMPALA resulted in clustering of 67 homologous protein families of Pfam into 28 potential new superfamilies. Expansion of groups of related proteins of yet unknown structural information, as proposed in SUPFAM, should help in identifying 'priority proteins' for structure determination in structural genomics initiatives to expand the coverage of structural information in the protein sequence space. For example, we could assign 858 distinct Pfam domains in 2203 of the gene products in the genome of Mycobacterium tubercolosis. Fifty-one of these Pfam families of unknown structure could be clustered into 17 potentially new superfamilies forming good targets for structural genomics. SUPFAM database can be accessed at http://pauling.mbu.iisc.ernet.in/similar tosupfam.
引用
收藏
页码:289 / 293
页数:5
相关论文
共 19 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   PALI - a database of Phylogeny and ALIgnment of homologous protein structures [J].
Balaji, S ;
Sujatha, S ;
Kumar, SSC ;
Srinivasan, N .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :61-65
[3]   Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins [J].
Balaji, S ;
Srinivasan, N .
PROTEIN ENGINEERING, 2001, 14 (04) :219-226
[4]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
[5]   The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues [J].
Bray, JE ;
Todd, AE ;
Pearl, FMG ;
Thornton, JM ;
Orengo, CA .
PROTEIN ENGINEERING, 2000, 13 (03) :153-165
[6]   The PRESAGE database for structural genomics [J].
Brenner, SE ;
Barken, D ;
Levitt, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :251-253
[7]  
Brenner SE, 2000, PROTEIN SCI, V9, P197
[8]   Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence [J].
Cole, ST ;
Brosch, R ;
Parkhill, J ;
Garnier, T ;
Churcher, C ;
Harris, D ;
Gordon, SV ;
Eiglmeier, K ;
Gas, S ;
Barry, CE ;
Tekaia, F ;
Badcock, K ;
Basham, D ;
Brown, D ;
Chillingworth, T ;
Connor, R ;
Davies, R ;
Devlin, K ;
Feltwell, T ;
Gentles, S ;
Hamlin, N ;
Holroyd, S ;
Hornby, T ;
Jagels, K ;
Krogh, A ;
McLean, J ;
Moule, S ;
Murphy, L ;
Oliver, K ;
Osborne, J ;
Quail, MA ;
Rajandream, MA ;
Rogers, J ;
Rutter, S ;
Seeger, K ;
Skelton, J ;
Squares, R ;
Squares, S ;
Sulston, JE ;
Taylor, K ;
Whitehead, S ;
Barrell, BG .
NATURE, 1998, 393 (6685) :537-+
[9]   HOMSTRAD: adding sequence information to structure-based alignments of homologous protein families [J].
de Bakker, PIW ;
Bateman, A ;
Burke, DF ;
Miguel, RN ;
Mizuguchi, K ;
Shi, J ;
Shirai, H ;
Blundell, TL .
BIOINFORMATICS, 2001, 17 (08) :748-749
[10]   A comparison of sequence and structure protein domain families as a basis for structural genomics [J].
Elofsson, A ;
Sonnhammer, ELL .
BIOINFORMATICS, 1999, 15 (06) :480-500