UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

被引:1073
作者
Suzek, Baris E. [1 ,2 ]
Wang, Yuqi [3 ,4 ]
Huang, Hongzhan [3 ,4 ]
McGarvey, Peter B. [1 ]
Wu, Cathy H. [1 ,3 ,4 ]
机构
[1] Georgetown Univ, Med Ctr, Prot Informat Resource, Washington, DC 20007 USA
[2] Mugla Sitki Kocman Univ, Dept Comp Engn, TR-48000 Mugla, Turkey
[3] Univ Delaware, Ctr Bioinformat & Computat Biol, Newark, DE 19711 USA
[4] Univ Delaware, Prot Informat Resource, Newark, DE 19711 USA
[5] Ctr Med Univ Geneva, Swiss Inst Bioinformat, CH-1211 Geneva 4, Switzerland
基金
美国国家卫生研究院;
关键词
PROTEIN; PREDICTION; ALIGNMENT; FAMILY;
D O I
10.1093/bioinformatics/btu739
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (similar to 7 times shorter hit list before expansion), faster (similar to 6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
引用
收藏
页码:926 / 932
页数:7
相关论文
共 31 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] Update on activities at the Universal Protein Resource (UniProt) in 2013
    Apweiler, Rolf
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Alam-Faruque, Yasmin
    Alpi, Emanuela
    Antunes, Ricardo
    Arganiska, Joanna
    Casanova, Elisabet Barrera
    Bely, Benoit
    Bingley, Mark
    Bonilla, Carlos
    Britto, Ramona
    Bursteinas, Borisas
    Chan, Wei Mun
    Chavali, Gayatri
    Cibrian-Uhalte, Elena
    Da Silva, Alan
    De Giorgi, Maurizio
    Dimmer, Emily
    Fazzini, Francesco
    Gane, Paul
    Fedotov, Alexander
    Castro, Leyla Garcia
    Garmiri, Penelope
    Hatton-Ellis, Emma
    Hieta, Reija
    Huntley, Rachael
    Jacobsen, Julius
    Jones, Rachel
    Legge, Duncan
    Liu, Wudong
    Luo, Jie
    MacDougall, Alistair
    Mutowo, Prudence
    Nightingale, Andrew
    Orchard, Sandra
    Patient, Samuel
    Pichler, Klemens
    Poggioli, Diego
    Pundir, Sangya
    Pureza, Luis
    Qi, Guoying
    Rosanoff, Steven
    Sawford, Tony
    Sehra, Harminder
    Turner, Edward
    Volynkin, Vladimir
    Wardell, Tony
    Watkins, Xavier
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D43 - D47
  • [3] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [4] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
  • [5] Clustered sequence representation for fast homology search
    Cameron, Michael
    Bernstein, Yaniv
    Williams, Hugh E.
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2007, 14 (05) : 594 - 614
  • [6] The oligodeoxynucleotide sequences corresponding to never-expressed peptide motifs are mainly located in the non-coding strand
    Capone, Giovanni
    Novello, Giuseppe
    Fasano, Candida
    Trost, Brett
    Bickis, Mik
    Kusalik, Anthony
    Kanduc, Darja
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [7] A new disease-specific machine learning approach for the prediction of cancer-causing missense variants
    Capriotti, Emidio
    Altman, Russ B.
    [J]. GENOMICS, 2011, 98 (04) : 310 - 317
  • [8] Improving the prediction of disease-related variants using protein three-dimensional structure
    Capriotti, Emidio
    Altman, Russ B.
    [J]. BMC BIOINFORMATICS, 2011, 12
  • [9] Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee
    Chang, Jia-Ming
    Di Tommaso, Paolo
    Taly, Jean-Francois
    Notredame, Cedric
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [10] Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation
    Chen, Chuming
    Natale, Darren A.
    Finn, Robert D.
    Huang, Hongzhan
    Zhang, Jian
    Wu, Cathy H.
    Mazumder, Raja
    [J]. PLOS ONE, 2011, 6 (04):