Reconciliation with Non-Binary Species Trees

被引:107
作者
Vernot, Benjamin [1 ]
Stolzer, Maureen [1 ]
Goldman, Aiton [1 ]
Durand, Dannie [1 ,2 ]
机构
[1] Carnegie Mellon Univ, Dept Biol Sci, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
关键词
deep coalescence; gene duplication; gene loss; lineage sorting; non-binary species trees; polytomy; reconciliation;
D O I
10.1089/cmb.2008.0092
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Reconciliation extracts information from the topological incongruence between gene and species trees to infer duplications and losses in the history of a gene family. The inferred duplication-loss histories provide valuable information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. While reconciliation for binary trees is a tractable and well studied problem, there are no algorithms for reconciliation with non-binary species trees. Yet a striking proportion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and incomplete lineage sorting. We present the first algorithms for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Our algorithms utilize an efficient mapping from gene to species trees to infer the minimum number of duplications in O(\V-G\ . (k(S) + h(S))) time, where \V-G\ is the number of nodes in the gene tree, h(S) is the height of the species tree and k(S) is the size of its largest polytomy. We present a dynamic programming algorithm which also minimizes the total number of losses. Although this algorithm is exponential in the size of the largest polytomy, it performs well in practice for polytomies with outdegree of 12 or less. We also present a heuristic which estimates the minimal number of losses in polynomial time. In empirical tests, this algorithm finds an optimal loss history 99% of the time. Our algorithms have been implemented in NOTUNG, a robust, production quality, tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.
引用
收藏
页码:981 / 1006
页数:26
相关论文
共 68 条
[1]  
[Anonymous], 2001, Proceedings of the 5th Annual International Conference on Research in Computational Molecular Biology, DOI [DOI 10.1145/369133.369188, 10.1145/369133.369188]
[2]  
ARVESTAD L, 2004, GENE TREE RECONSTRUC, P326
[3]   Bayesian gene/species tree reconciliation and orthology analysis using MCMC [J].
Arvestad, Lars ;
Berglund, Ann-Charlotte ;
Lagergren, Jens ;
Sennblad, Bengt .
BIOINFORMATICS, 2003, 19 :i7-i15
[4]  
Bansal MS, 2007, LECT NOTES COMPUT SC, V4453, P238
[5]   The LCA problem revisited [J].
Bender, MA ;
Farach-Colton, M .
LATIN 2000: THEORETICAL INFORMATICS, 2000, 1776 :88-94
[6]   Optimal gene trees from sequences and species trees using a soft interpretation of parsimony [J].
Berglund-Sonnhammer, Ann-Charlotte ;
Steffansson, Par ;
Betts, Matthew J. ;
Liberles, David A. .
JOURNAL OF MOLECULAR EVOLUTION, 2006, 63 (02) :240-250
[7]   The gain and loss of genes during 600 million years of vertebrate evolution [J].
Blomme, Tine ;
Vandepoele, Klaas ;
De Bodt, Stefanie ;
Simillion, Cedric ;
Maere, Steven ;
Van de Peer, Yves .
GENOME BIOLOGY, 2006, 7 (05)
[8]   The serine repeat antigen (SERA) gene family phylogeny in Plasmodium:: The impact of GC content and reconciliation of gene and species trees [J].
Bourgon, R ;
Delorenzi, M ;
Sargeant, T ;
Hodder, AN ;
Crabb, BS ;
Speed, TP .
MOLECULAR BIOLOGY AND EVOLUTION, 2004, 21 (11) :2161-2171
[9]  
Chang WC, 2006, LECT NOTES COMPUT SC, V4112, P235
[10]  
CHAUVE C, 2007, INFERRING DUPLICATIO, P45