The properties of protein family space depend on experimental design

被引:10
作者
Kunin, V
Teichmann, SA
Huynen, MA
Ouzounis, CA
机构
[1] European Bioinformat Inst, EMBL Cambridge Outstn, Computat Genom Grp, Cambridge CB10 1SD, England
[2] MRC, Mol Biol Lab, Cambridge CB2 2QH, England
[3] Univ Nijmegen, Ctr Mol & Biomol Informat, Nijmegen Ctr Mol Life Sci, Nijmegen, Netherlands
关键词
D O I
10.1093/bioinformatics/bti386
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Databases of protein families often exhibit drastically different properties of the protein family space. Results: We compared the properties of protein family space as reflected by exhaustive protein family databases and databases with predefined families. We used TRIBES, Protomap, ProDom and COGs as representatives of the exhaustive databases, and Pfam-A and Superfamily as databases that predefine families. We observe a power-law distribution of family sizes in all these databases, albeit in predefined databases the power-law line collapses before reaching smaller sized families. We discuss the future trends of this power-law distribution and suggest that saturation in the sampling of protein family space will result in a distortion of the power law in small family sizes. For larger genome sizes, predefined databases show logarithmic growth of the number of families per genome, whereas exhaustive databases exhibit a virtually linear relationship. All databases consistently differ in the proportion of protein families shared between taxa. Predefined databases have a larger number of protein families shared between the three domains of life, while exhaustive databases show a much more fragmented distribution. We argue that these discrepancies reflect alternative approaches to the trade-off issue of sensitivity versus specificity in the detection of homologous proteins. We conclude that these properties are complementary rather than contradictory, while describing the protein universe from different perspectives.
引用
收藏
页码:2618 / 2622
页数:5
相关论文
共 23 条
[1]   Genome and virulence determinants of high virulence community-acquired MRSA [J].
Baba, T ;
Takeuchi, F ;
Kuroda, M ;
Yuzawa, H ;
Aoki, K ;
Oguchi, A ;
Nagai, Y ;
Iwama, N ;
Asano, K ;
Naimi, T ;
Kuroda, H ;
Cui, L ;
Yamamoto, K ;
Hiramatsu, K .
LANCET, 2002, 359 (9320) :1819-1827
[2]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[3]  
Bates PA, 2001, PROTEINS, P39
[4]   The ProDom database of protein domain families: more emphasis on 3D [J].
Bru, C ;
Courcelle, E ;
Carrre, S ;
Beausse, Y ;
Dalmar, S ;
Kahn, D .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D212-D215
[5]   Evolution of the protein repertoire [J].
Chothia, C ;
Gough, J ;
Vogel, C ;
Teichmann, SA .
SCIENCE, 2003, 300 (5626) :1701-1703
[6]   ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons [J].
Corpet, F ;
Servant, F ;
Gouzy, J ;
Kahn, D .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :267-269
[7]   Protein families and TRIBES in genome sequence space [J].
Enright, AJ ;
Kunin, V ;
Ouzounis, CA .
NUCLEIC ACIDS RESEARCH, 2003, 31 (15) :4632-4638
[8]   Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure [J].
Gough, J ;
Karplus, K ;
Hughey, R ;
Chothia, C .
JOURNAL OF MOLECULAR BIOLOGY, 2001, 313 (04) :903-919
[9]   The TIGRFAMs database of protein families [J].
Haft, DH ;
Selengut, JD ;
White, O .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :371-373
[10]   Studying Genomes through the aeons: Protein families, pseudogenes and proteome evolution [J].
Harrison, PM ;
Gerstein, M .
JOURNAL OF MOLECULAR BIOLOGY, 2002, 318 (05) :1155-1174