The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction

被引:24
作者
Huttenhower, Curtis [1 ,2 ]
Hibbs, Matthew A. [3 ]
Myers, Chad L. [4 ]
Caudy, Amy A. [2 ]
Hess, David C. [2 ]
Troyanskaya, Olga G. [1 ,2 ]
机构
[1] Princeton Univ, Dept Comp Sci, Princeton, NJ 08540 USA
[2] Princeton Univ, Carl Icahn Lab, Lewis Sigler Inst Integrat Genom, Princeton, NJ 08544 USA
[3] Jackson Lab, Bar Harbor, ME 04609 USA
[4] Univ Minnesota, Dept Comp Sci, Minneapolis, MN 55455 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
GENE-EXPRESSION; INTEGRATION; ANNOTATION; NETWORKS; DATABASE; FRAMEWORK; ONTOLOGY; TOOLS;
D O I
10.1093/bioinformatics/btp397
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question. Results: We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches even those employing the same training data-is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations.
引用
收藏
页码:2404 / 2410
页数:7
相关论文
共 27 条
[1]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[2]   NCBI GEO: mining tens of millions of expression profiles - database and tools update [J].
Barrett, Tanya ;
Troup, Dennis B. ;
Wilhite, Stephen E. ;
Ledoux, Pierre ;
Rudnev, Dmitry ;
Evangelista, Carlos ;
Kim, Irene F. ;
Soboleva, Alexandra ;
Tomashevsky, Maxim ;
Edgar, Ron .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D760-D765
[3]   Hierarchical multi-label prediction of gene function [J].
Barutcuoglu, Z ;
Schapire, RE ;
Troyanskaya, OG .
BIOINFORMATICS, 2006, 22 (07) :830-836
[4]   Alternative splicing: New insights from global analyses [J].
Blencowe, Benjamin J. .
CELL, 2006, 126 (01) :37-47
[5]  
Demeter J, 2007, NUCLEIC ACIDS RES, V35, pD766, DOI 10.1093/nar/gkl1019
[6]   Computationally Driven, Quantitative Experiments Discover Genes Required for Mitochondrial Biogenesis [J].
Hess, David C. ;
Myers, Chad L. ;
Huttenhower, Curtis ;
Hibbs, Matthew A. ;
Hayes, Alicia P. ;
Paw, Jadine ;
Clore, John J. ;
Mendoza, Rosa M. ;
Luis, Bryan San ;
Nislow, Corey ;
Giaever, Guri ;
Costanzo, Michael ;
Troyanskaya, Olga G. ;
Caudy, Amy A. .
PLOS GENETICS, 2009, 5 (03)
[7]   Exploring the functional landscape of gene expression: directed search of large microarray compendia [J].
Hibbs, Matthew A. ;
Hess, David C. ;
Myers, Chad L. ;
Huttenhower, Curtis ;
Li, Kai ;
Troyanskaya, Olga G. .
BIOINFORMATICS, 2007, 23 (20) :2692-2699
[8]   Directing Experimental Biology: A Case Study in Mitochondrial Biogenesis [J].
Hibbs, Matthew A. ;
Myers, Chad L. ;
Huttenhower, Curtis ;
Hess, David C. ;
Li, Kai ;
Caudy, Amy A. ;
Troyanskaya, Olga G. .
PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (03)
[9]   Gene Ontology annotations at SGD: new data sources and annotation methods [J].
Hong, Eurie L. ;
Balakrishnan, Rama ;
Dong, Qing ;
Christie, Karen R. ;
Park, Julie ;
Binkley, Gail ;
Costanzo, Maria C. ;
Dwight, Selina S. ;
Engel, Stacia R. ;
Fisk, Dianna G. ;
Hirschman, Jodi E. ;
Hitz, Benjamin C. ;
Krieger, Cynthia J. ;
Livstone, Michael S. ;
Miyasato, Stuart R. ;
Nash, Robert S. ;
Oughtred, Rose ;
Skrzypek, Marek S. ;
Weng, Shuai ;
Wong, Edith D. ;
Zhu, Kathy K. ;
Dolinski, Kara ;
Botstein, David ;
Cherry, J. Michael .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D577-D581
[10]   A scalable method for integration and functional analysis of multiple microarray datasets [J].
Huttenhower, Curtis ;
Hibbs, Matt ;
Myers, Chad ;
Troyanskaya, Olga G. .
BIOINFORMATICS, 2006, 22 (23) :2890-2897