Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation

被引:66
作者
Chen, Chuming [1 ]
Natale, Darren A. [2 ]
Finn, Robert D. [3 ]
Huang, Hongzhan [1 ]
Zhang, Jian [2 ]
Wu, Cathy H. [1 ,2 ]
Mazumder, Raja [2 ]
机构
[1] Univ Delaware, Ctr Bioinformat & Computat Biol, Newark, DE 19716 USA
[2] Georgetown Univ, Med Ctr, Dept Biochem & Mol & Cellular Biol, Washington, DC 20007 USA
[3] Howard Hughes Med Inst, Ashburn, VA USA
来源
PLOS ONE | 2011年 / 6卷 / 04期
基金
美国国家卫生研究院;
关键词
TAXONOMY;
D O I
10.1371/journal.pone.0018910
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.
引用
收藏
页数:9
相关论文
共 15 条
[1]   The Universal Protein Resource (UniProt) in 2010 [J].
Apweiler, Rolf ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Alam-Faruque, Yasmin ;
Antunes, Ricardo ;
Barrell, Daniel ;
Bely, Benoit ;
Bingley, Mark ;
Binns, David ;
Bower, Lawrence ;
Browne, Paul ;
Chan, Wei Mun ;
Dimmer, Emily ;
Eberhardt, Ruth ;
Fedotov, Alexander ;
Foulger, Rebecca ;
Garavelli, John ;
Huntley, Rachael ;
Jacobsen, Julius ;
Kleen, Michael ;
Laiho, Kati ;
Leinonen, Rasko ;
Legge, Duncan ;
Lin, Quan ;
Liu, Wudong ;
Luo, Jie ;
Orchard, Sandra ;
Patient, Samuel ;
Poggioli, Diego ;
Pruess, Manuela ;
Corbett, Matt ;
di Martino, Giuseppe ;
Donnelly, Mike ;
van Rensburg, Pieter ;
Bairoch, Amos ;
Bougueleret, Lydie ;
Xenarios, Ioannis ;
Altairac, Severine ;
Auchincloss, Andrea ;
Argoud-Puy, Ghislaine ;
Axelsen, Kristian ;
Baratin, Delphine ;
Blatter, Marie-Claude ;
Boeckmann, Brigitte ;
Bolleman, Jerven ;
Bollondi, Laurent ;
Boutet, Emmanuel ;
Quintaje, Silvia Braconi ;
Breuza, Lionel .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D142-D148
[2]   The evolutionary history of shigella and enteroinvasive Escherichia coli revised [J].
Escobar-Páramo, P ;
Giudicelli, C ;
Parsot, C ;
Denamur, E .
JOURNAL OF MOLECULAR EVOLUTION, 2003, 57 (02) :140-148
[3]  
Flicek P, 2011, NUCL ACIDS RES
[4]   Joining forces in the quest for orthologs [J].
Gabaldon, Toni ;
Dessimoz, Christophe ;
Huxley-Jones, Julie ;
Vilella, Albert J. ;
Sonnhammer, Erik L. L. ;
Lewis, Suzanna .
GENOME BIOLOGY, 2009, 10 (09)
[5]  
GO Consortium, 2009, PLOS COMPUT BIOL, V5
[6]   Integration of bioinformatics resources for functional analysis of gene expression and proteomic data [J].
Huang, Hongzhan ;
Hu, Zhang-Zhi ;
Arighi, Cecilia N. ;
Wu, Cathy H. .
FRONTIERS IN BIOSCIENCE-LANDMARK, 2007, 12 :5071-5088
[7]   InterPro: the integrative protein signature database [J].
Hunter, Sarah ;
Apweiler, Rolf ;
Attwood, Teresa K. ;
Bairoch, Amos ;
Bateman, Alex ;
Binns, David ;
Bork, Peer ;
Das, Ujjwal ;
Daugherty, Louise ;
Duquenne, Lauranne ;
Finn, Robert D. ;
Gough, Julian ;
Haft, Daniel ;
Hulo, Nicolas ;
Kahn, Daniel ;
Kelly, Elizabeth ;
Laugraud, Aurelie ;
Letunic, Ivica ;
Lonsdale, David ;
Lopez, Rodrigo ;
Madera, Martin ;
Maslen, John ;
McAnulla, Craig ;
McDowall, Jennifer ;
Mistry, Jaina ;
Mitchell, Alex ;
Mulder, Nicola ;
Natale, Darren ;
Orengo, Christine ;
Quinn, Antony F. ;
Selengut, Jeremy D. ;
Sigrist, Christian J. A. ;
Thimma, Manjula ;
Thomas, Paul D. ;
Valentin, Franck ;
Wilson, Derek ;
Wu, Cathy H. ;
Yeats, Corin .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D211-D215
[8]   Phylogenetic taxonomy of the family Chlorobiaceae on the basis of 16S rRNA and fmo (Fenna Matthews-Olson protein) gene sequences [J].
Imhoff, JF .
INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY, 2003, 53 :941-951
[9]   UniProt archive [J].
Leinonen, R ;
Diez, FG ;
Binns, D ;
Fleischmann, W ;
Lopez, R ;
Apweiler, R .
BIOINFORMATICS, 2004, 20 (17) :3236-3237
[10]   Computational identification of strain-, species- and genus-specific proteins [J].
Mazumder, R ;
Natale, DA ;
Murthy, S ;
Thiagarajan, R ;
Wu, CH .
BMC BIOINFORMATICS, 2005, 6 (1)