UniRef: comprehensive and non-redundant UniProt reference clusters

被引:1007
作者
Suzek, Baris E. [1 ]
Huang, Hongzhan [1 ]
McGarvey, Peter [1 ]
Mazumder, Raja [1 ]
Wu, Cathy H. [1 ]
机构
[1] Georgetown Univ, Med Ctr, Prot Informat Resource, Dept Biochem Mol & Cell Biol, Washington, DC 20007 USA
关键词
D O I
10.1093/bioinformatics/btm098
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. Results: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UnProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef-100, UniRef90 and UniRef50 yield a database size reduction of similar to 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis.
引用
收藏
页码:1282 / 1288
页数:7
相关论文
共 50 条
[1]   The universal protein resource (UniProt) [J].
Bairoch, Amos ;
Bougueleret, Lydie ;
Altairac, Severine ;
Amendolia, Valeria ;
Auchincloss, Andrea ;
Puy, Ghislaine Argoud ;
Axelsen, Kristian ;
Baratin, Delphine ;
Blatter, Marie-Claude ;
Boeckmann, Brigitte ;
Bollondi, Laurent ;
Boutet, Emmanuel ;
Quintaje, Silvia Braconi ;
Breuza, Lionel ;
Bridge, Alan ;
deCastro, Edouard ;
Coral, Danielle ;
Coudert, Elisabeth ;
Cusin, Isabelle ;
Dobrokhotov, Pavel ;
Dornevil, Dolnide ;
Duvaud, Severine ;
Estreicher, Anne ;
Famiglietti, Livia ;
Feuermann, Marc ;
Gehant, Sebastian ;
Farriol-Mathis, Nathalie ;
Ferro, Serenella ;
Gasteiger, Elisabeth ;
Gateau, Alain ;
Gerritsen, Vivienne ;
Gos, Arnaud ;
Gruaz-Gumowski, Nadine ;
Hinz, Ursula ;
Hulo, Chantal ;
Hulo, Nicolas ;
Ioannidis, Vassilios ;
Ivanyi, Ivan ;
James, Janet ;
Jain, Eric ;
Jimenez, Silvia ;
Jungo, Florence ;
Junker, Vivien ;
Keller, Guillaume ;
Lachaize, Corinne ;
Lane-Guermonprez, Lydie ;
Langendijk-Genevaux, Petra ;
Lara, Vicente ;
Lemercier, Philippe ;
Le Saux, Virginie .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D193-D197
[2]  
BARNOSA D, 2006, ISMB2006
[3]   Databases and information integration for the Medicago truncatula genome and transcriptome [J].
Cannon, SB ;
Crow, JA ;
Heuer, ML ;
Wang, XH ;
Cannon, EKS ;
Dwan, C ;
Lamblin, AF ;
Vasdewani, J ;
Mudge, J ;
Cook, A ;
Gish, J ;
Cheung, F ;
Kenton, S ;
Kunau, TM ;
Brown, D ;
May, GD ;
Kim, D ;
Cook, DR ;
Roe, BA ;
Town, CD ;
Young, ND ;
Retzel, EF .
PLANT PHYSIOLOGY, 2005, 138 (01) :38-46
[4]   On single and multiple models of protein families for the detection of remote sequence relationships [J].
Casbon, JA ;
Saqi, MAS .
BMC BIOINFORMATICS, 2006, 7 (1)
[5]   Proteomic and bioinformatic characterization of the biogenesis and function of melanosomes [J].
Chi, An ;
Valencia, Julio C. ;
Hu, Zhang-Zhi ;
Watabe, Hidenori ;
Yamaguchi, Hiroshi ;
Mangini, Nancy J. ;
Huang, Hongzhan ;
Canfield, Victor A. ;
Cheng, Keith C. ;
Yang, Feng ;
Abe, Riichiro ;
Yamagishi, Shoichi ;
Shabanowitz, Jeffrey ;
Hearing, Vincent J. ;
Wu, Cathy ;
Appella, Ettore ;
Hunt, Donald F. .
JOURNAL OF PROTEOME RESEARCH, 2006, 5 (11) :3135-3144
[6]   The TIGR plant transcript assemblies database [J].
Childs, Kevin L. ;
Hamilton, John P. ;
Zhu, Wei ;
Ly, Eugene ;
Cheung, Foo ;
Wu, Hank ;
Rabinowicz, Pablo D. ;
Town, Chris D. ;
Buell, C. Robin ;
Chan, Agnes P. .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D846-D851
[7]   GeneRAGE: a robust algorithm for sequence clustering and domain detection [J].
Enright, AJ ;
Ouzounis, CA .
BIOINFORMATICS, 2000, 16 (05) :451-457
[8]   Saturating representation of loop conformational fragments in structure databanks [J].
Fernandez-Fuentes, Narcis ;
Fiser, Andras .
BMC STRUCTURAL BIOLOGY, 2006, 6
[9]   Identification of multiple distinct Snf2 subfamilies with conserved structural motifs [J].
Flaus, Andrew ;
Martin, David M. A. ;
Barton, Geoffrey J. ;
Owen-Hughes, Tom .
NUCLEIC ACIDS RESEARCH, 2006, 34 (10) :2887-2905
[10]   Proteome profiling of human epithelial ovarian cancer cell line TOV-112D [J].
Gagné, JP ;
Gagné, P ;
Hunter, JM ;
Bonicalzi, ME ;
Lemay, JF ;
Kelly, I ;
Le Page, C ;
Provencher, D ;
Mes-Masson, AM ;
Droit, A ;
Bourgais, D ;
Poirier, GG .
MOLECULAR AND CELLULAR BIOCHEMISTRY, 2005, 275 (1-2) :25-55