An efficient algorithm for large-scale detection of protein families

被引:2555
作者
Enright, AJ [1 ]
Van Dongen, S
Ouzounis, CA
机构
[1] European Bioinformat Inst, EMBL Cambridge Outstn, Computat Gen Grp, Cambridge CB10 1SD, England
[2] Ctr Wiskunde & Informat, NL-1098 SJ Amsterdam, Netherlands
关键词
D O I
10.1093/nar/30.7.1575
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
引用
收藏
页码:1575 / 1584
页数:10
相关论文
共 52 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] [Anonymous], 2000, GRAPH CLUSTERING FLO
  • [3] Apic G, 2001, Bioinformatics, V17 Suppl 1, pS83
  • [4] Domain combinations in archaeal, eubacterial and eukaryotic proteomes
    Apic, G
    Gough, J
    Teichmann, SA
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 2001, 310 (02) : 311 - 325
  • [5] The InterPro database, an integrated documentation resource for protein families, domains and functional sites
    Apweiler, R
    Attwood, TK
    Bairoch, A
    Bateman, A
    Birney, E
    Biswas, M
    Bucher, P
    Cerutti, T
    Corpet, F
    Croning, MDR
    Durbin, R
    Falquet, L
    Fleischmann, W
    Gouzy, J
    Hermjakob, H
    Hulo, N
    Jonassen, I
    Kahn, D
    Kanapin, A
    Karavidopoulou, Y
    Lopez, R
    Marx, B
    Mulder, NJ
    Oinn, TM
    Pagni, M
    Servant, F
    Sigrist, CJA
    Zdobnov, EM
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (01) : 37 - 40
  • [6] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [7] PRINTS prepares for the new millennium
    Attwood, TK
    Flower, DR
    Lewis, AP
    Mabey, JE
    Morgan, SR
    Scordis, P
    Selley, JN
    Wright, W
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (01) : 220 - 225
  • [8] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [9] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkh121, 10.1093/nar/gkr1065]
  • [10] Genomes OnLine Database (GOLD): a monitor of genome projects world-wide
    Bernal, A
    Ear, U
    Kyrpides, N
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (01) : 126 - 127