Protein molecular function prediction by Bayesian phylogenomics

被引:113
作者
Engelhardt, BE [1 ]
Jordan, MI
Muratore, KE
Brenner, SE
机构
[1] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA USA
[3] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley, CA USA
[4] Univ Calif Berkeley, Dept Plant & Microbial Biol, Berkeley, CA USA
关键词
D O I
10.1371/journal.pcbi.0010045
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
引用
收藏
页码:432 / 445
页数:14
相关论文
共 73 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] Automated genome sequence analysis and annotation
    Andrade, MA
    Brown, NP
    Leroy, C
    Hoersch, S
    de Daruvar, A
    Reich, C
    Franchini, A
    Tamames, J
    Valencia, A
    Ouzounis, C
    Sander, C
    [J]. BIOINFORMATICS, 1999, 15 (05) : 391 - 412
  • [3] Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families
    Andrade, MA
    Valencia, A
    [J]. BIOINFORMATICS, 1998, 14 (07) : 600 - 607
  • [4] [Anonymous], 2000, C&H TEXT STAT SCI
  • [5] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [6] A natural classification of the basic helix-loop-helix class of transcription factors
    Atchley, WR
    Fitch, WM
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1997, 94 (10) : 5172 - 5176
  • [7] Lactate dehydrogenase from the hyperthermophilic bacterium Thermotoga maritima:: the crystal structure at 2.1 Å resolution reveals strategies for intrinsic protein stabilization
    Auerbach, G
    Ostendorp, R
    Prade, L
    Korndörfer, I
    Dams, T
    Huber, R
    Jaenicke, R
    [J]. STRUCTURE, 1998, 6 (06) : 769 - 781
  • [8] Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
  • [9] The Bayesian revolution in genetics
    Beaumont, MA
    Rannala, B
    [J]. NATURE REVIEWS GENETICS, 2004, 5 (04) : 251 - 261
  • [10] Predicting functions from protein sequences - where are the bottlenecks?
    Bork, P
    Koonin, EV
    [J]. NATURE GENETICS, 1998, 18 (04) : 313 - 318