Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes

被引:287
作者
Chen, Feng [2 ,3 ]
Mackey, Aaron J. [1 ,3 ]
Vermunt, Jeroen K. [4 ]
Roos, David S. [1 ,3 ]
机构
[1] Univ Penn, Dept Biol, Philadelphia, PA 19104 USA
[2] Univ Penn, Dept Chem, Philadelphia, PA 19104 USA
[3] Univ Penn, Genom Inst, Philadelphia, PA 19104 USA
[4] Tilburg Univ, Dept Methodol & Stat, Tilburg, Netherlands
来源
PLOS ONE | 2007年 / 2卷 / 04期
关键词
D O I
10.1371/journal.pone.0000383
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale 'gold standard' orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology.
引用
收藏
页数:12
相关论文
共 40 条
[1]   Automatic clustering of orthologs and inparalogs shared by multiple proteomes [J].
Alexeyenko, Andrey ;
Tamas, Ivica ;
Liu, Gang ;
Sonnhammer, Erik L. L. .
BIOINFORMATICS, 2006, 22 (14) :E9-E15
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]  
[Anonymous], 2000, GRAPH CLUSTERING FLO
[4]   Systematic identification of functional orthologs based on protein network comparison [J].
Bandyopadhyay, S ;
Sharan, R ;
Ideker, T .
GENOME RESEARCH, 2006, 16 (03) :428-435
[5]   OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups [J].
Chen, Feng ;
Mackey, Aaron J. ;
Stoeckert, Christian J., Jr. ;
Roos, David S. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D363-D368
[6]   Roundup: a multi-genome repository of orthologs and evolutionary distances [J].
DeLuca, Todd F. ;
Wu, I-Hsien ;
Pu, Jian ;
Monaghan, Thomas ;
Peshkin, Leonid ;
Singh, Saurav ;
Wall, Dennis P. .
BIOINFORMATICS, 2006, 22 (16) :2044-2046
[7]   THE MULTIPLICITY OF DOMAINS IN PROTEINS [J].
DOOLITTLE, RF .
ANNUAL REVIEW OF BIOCHEMISTRY, 1995, 64 :287-314
[8]   Creating a honey bee consensus gene set [J].
Elsik, Christine G. ;
Mackey, Aaron J. ;
Reese, Justin T. ;
Milshina, Natalia V. ;
Roos, David S. ;
Weinstock, George M. .
GENOME BIOLOGY, 2007, 8 (01)
[9]   An efficient algorithm for large-scale detection of protein families [J].
Enright, AJ ;
Van Dongen, S ;
Ouzounis, CA .
NUCLEIC ACIDS RESEARCH, 2002, 30 (07) :1575-1584
[10]   BioLayout - an automatic graph layout algorithm for similarity visualization [J].
Enright, AJ ;
Ouzounis, CA .
BIOINFORMATICS, 2001, 17 (09) :853-854