UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

被引:1073
作者
Suzek, Baris E. [1 ,2 ]
Wang, Yuqi [3 ,4 ]
Huang, Hongzhan [3 ,4 ]
McGarvey, Peter B. [1 ]
Wu, Cathy H. [1 ,3 ,4 ]
机构
[1] Georgetown Univ, Med Ctr, Prot Informat Resource, Washington, DC 20007 USA
[2] Mugla Sitki Kocman Univ, Dept Comp Engn, TR-48000 Mugla, Turkey
[3] Univ Delaware, Ctr Bioinformat & Computat Biol, Newark, DE 19711 USA
[4] Univ Delaware, Prot Informat Resource, Newark, DE 19711 USA
[5] Ctr Med Univ Geneva, Swiss Inst Bioinformat, CH-1211 Geneva 4, Switzerland
基金
美国国家卫生研究院;
关键词
PROTEIN; PREDICTION; ALIGNMENT; FAMILY;
D O I
10.1093/bioinformatics/btu739
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (similar to 7 times shorter hit list before expansion), faster (similar to 6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
引用
收藏
页码:926 / 932
页数:7
相关论文
共 31 条
  • [11] A Computational Screen for Type I Polyketide Synthases in Metagenomics Shotgun Data
    Foerstner, Konrad U.
    Doerks, Tobias
    Creevey, Christopher J.
    Doerks, Anja
    Bork, Peer
    [J]. PLOS ONE, 2008, 3 (10):
  • [12] The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species
    Gaudet, Pascale
    Chisholm, Rex
    Berardini, Tanya
    Dimmer, Emily
    Engel, Stacia R.
    Fey, Petra
    Hill, David P.
    Howe, Doug
    Hu, James C.
    Huntley, Rachael
    Khodiyar, Varsha K.
    Kishore, Ranjana
    Li, Donghui
    Lovering, Ruth C.
    McCarthy, Fiona
    Ni, Li
    Petri, Victoria
    Siegele, Deborah A.
    Tweedie, Susan
    Van Auken, Kimberly
    Wood, Valerie
    Basu, Siddhartha
    Carbon, Seth
    Dolan, Mary
    Mungall, Christopher J.
    Dolinski, Kara
    Thomas, Paul
    Ashburner, Michael
    Blake, Judith A.
    Cherry, J. Michael
    Lewis, Suzanna E.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (07)
  • [13] Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching
    Gribskov, M
    Robinson, NL
    [J]. COMPUTERS & CHEMISTRY, 1996, 20 (01): : 25 - 33
  • [14] InterPro in 2011: new developments in the family and domain prediction database
    Hunter, Sarah
    Jones, Philip
    Mitchell, Alex
    Apweiler, Rolf
    Attwood, Teresa K.
    Bateman, Alex
    Bernard, Thomas
    Binns, David
    Bork, Peer
    Burge, Sarah
    de Castro, Edouard
    Coggill, Penny
    Corbett, Matthew
    Das, Ujjwal
    Daugherty, Louise
    Duquenne, Lauranne
    Finn, Robert D.
    Fraser, Matthew
    Gough, Julian
    Haft, Daniel
    Hulo, Nicolas
    Kahn, Daniel
    Kelly, Elizabeth
    Letunic, Ivica
    Lonsdale, David
    Lopez, Rodrigo
    Madera, Martin
    Maslen, John
    McAnulla, Craig
    McDowall, Jennifer
    McMenamin, Conor
    Mi, Huaiyu
    Mutowo-Muellenet, Prudence
    Mulder, Nicola
    Natale, Darren
    Orengo, Christine
    Pesseat, Sebastien
    Punta, Marco
    Quinn, Antony F.
    Rivoire, Catherine
    Sangrador-Vegas, Amaia
    Selengut, Jeremy D.
    Sigrist, Christian J. A.
    Scheremetjew, Maxim
    Tate, John
    Thimmajanarthanan, Manjulapramila
    Thomas, Paul D.
    Wu, Cathy H.
    Yeats, Corin
    Yong, Siew-Yit
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) : D306 - D312
  • [15] Itoh Masumi, 2004, Genome Inform, V15, P93
  • [16] The properties of protein family space depend on experimental design
    Kunin, V
    Teichmann, SA
    Huynen, MA
    Ouzounis, CA
    [J]. BIOINFORMATICS, 2005, 21 (11) : 2618 - 2622
  • [17] Identification and distribution of protein families in 120 completed genomes using Gene3D
    Lee, D
    Grant, A
    Marsden, RL
    Orengo, C
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 59 (03) : 603 - 615
  • [18] Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
    Li, Weizhong
    Godzik, Adam
    [J]. BIOINFORMATICS, 2006, 22 (13) : 1658 - 1659
  • [19] Sequence clustering strategies improve remote homology recognitions while reducing search times
    Li, WZ
    Jaroszewski, L
    Godzik, A
    [J]. PROTEIN ENGINEERING, 2002, 15 (08): : 643 - 649
  • [20] Clustering of highly homologous sequences to reduce the size of large protein databases
    Li, WZ
    Jaroszewski, L
    Godzik, A
    [J]. BIOINFORMATICS, 2001, 17 (03) : 282 - 283