Minimizing proteome redundancy in the UniProt Knowledgebase

被引:18
作者
Bursteinas, Borisas [1 ]
Britto, Ramona [1 ]
Bely, Benoit [1 ]
Auchincloss, Andrea [2 ]
Rivoire, Catherine [2 ]
Redaschi, Nicole [2 ]
O'Donovan, Claire [1 ]
Martin, Maria Jesus [1 ]
机构
[1] EBI, EMBL, Wellcome Trust Genome Campus, Cambridge CB10 1SD, England
[2] Ctr Med Univ Geneva, SIB, 1 Rue Michel Servet, CH-1211 Geneva 4, Switzerland
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2016年
基金
美国国家卫生研究院;
关键词
D O I
10.1093/database/baw139
中图分类号
Q [生物科学];
学科分类号
090105 [作物生产系统与生态工程];
摘要
Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a high level of redundancy in nucleotide databases and consequently in the UniProt Knowledgebase (UniProtKB). Redundancy negatively impacts on database searches by causing slower searches, an increase in statistical bias and cumbersome result analysis. The redundancy combined with the large data volume increases the computational costs for most reuses of UniProtKB data. All of this poses challenges for effective discovery in this wealth of data. With the continuing development of sequencing technologies, it is clear that finding ways to minimize redundancy is crucial to maintaining UniProt's essential contribution to data interpretation by our users. We have developed a methodology to identify and remove highly redundant proteomes from UniProtKB. The procedure identifies redundant proteomes by performing pairwise alignments of sets of sequences for pairs of proteomes and subsequently, applies graph theory to find dominating sets that provide a set of non-redundant proteomes with a minimal loss of information. This method was implemented for bacteria in mid-2015, resulting in a removal of 50 million proteins in UniProtKB. With every new release, this procedure is used to filter new incoming proteomes, resulting in a more scalable and scientifically valuable growth of UniProtKB.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 25 条
[1]
[Anonymous], 1979, Computers and Intractablity: A Guide to the Theory of NP-Completeness
[2]
UniProt: a hub for protein information [J].
Bateman, Alex ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Apweiler, Rolf ;
Alpi, Emanuele ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Fazzini, Francesco ;
Gane, Paul ;
Cas-tro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightin-gale, Andrew ;
Orchard, Sandra ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier ;
Zellner, Hermann ;
Cowley, Andrew ;
Figueira, Luis ;
Li, Weizhong ;
McWilliam, Hamish .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D204-D212
[3]
Bernstein Y., 2006, DETECTION MANAGEMENT
[4]
Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory [J].
Bull, Simon C. ;
Muldoon, Mark R. ;
Doig, Andrew J. .
PLOS ONE, 2013, 8 (02)
[5]
Clustered sequence representation for fast homology search [J].
Cameron, Michael ;
Bernstein, Yaniv ;
Williams, Hugh E. .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2007, 14 (05) :594-614
[6]
Search and clustering orders of magnitude faster than BLAST [J].
Edgar, Robert C. .
BIOINFORMATICS, 2010, 26 (19) :2460-2461
[7]
Gaspers S, 2012, DISCRETE MATH THEOR, V14, P29
[8]
ExPASy: the proteomics server for in-depth protein knowledge and analysis [J].
Gasteiger, E ;
Gattiker, A ;
Hoogland, C ;
Ivanyi, I ;
Appel, RD ;
Bairoch, A .
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) :3784-3788
[9]
Hofherr SE, 2011, PLOS ONE, V6, DOI [10.1371/journal.pone.0017076, 10.1371/journal.pone.0023376]
[10]
Removing near-neighbour redundancy from large protein sequence collections [J].
Holm, L ;
Sander, C .
BIOINFORMATICS, 1998, 14 (05) :423-429