Likelihood analysis of phylogenetic networks using directed graphical models

被引:57
作者
Strimmer, K
Moulton, V [1 ]
机构
[1] Mid Sweden Univ, Dept Math & Phys, FMI, S-85170 Sundsvall, Sweden
[2] Max Planck Inst Biochem, MIPS, GSF Forschungszentrum Umwelt & Gesundheit, D-82152 Martinsried, Germany
关键词
maximum likelihood; phylogenetic network; graphical model; Bayesian network; evolutionary tree; Markov chain Monte Carlo sampling;
D O I
10.1093/oxfordjournals.molbev.a026367
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A method for computing the likelihood of a set of sequences assuming a phylogenetic network as an evolutionary hypothesis is presented. The approach applies directed graphical models to sequence evolution on networks and is a natural generalization of earlier work by Felsenstein on evolutionary trees, including it as a special case. The likelihood computation involves several steps. First, the phylogenetic network is rooted to form a directed acyclic graph (DAG). Then, applying standard models for nucleotide/amino acid substitution, the DAG is converted into a Bayesian network from which the joint probability distribution involving all nodes of the network can be directly read. The joint probability is explicitly dependent on branch lengths and on recombination parameters (prior probability of a parent sequence). The likelihood of the data assuming no knowledge of hidden nodes is obtained by marginalization, i.e., by summing over all combinations of unknown states. As the number of terms increases exponentially with the number of hidden nodes, a Markov chain Monte Carlo procedure (Gibbs sampling) is used to accurately approximate the likelihood by summing over the most important states only. Investigating a human T-cell lymphotropic virus (HTLV) data set and optimizing both branch lengths and recombination parameters, we find that the likelihood of a corresponding phylogenetic network outperforms a set of competing evolutionary trees. In general, except for the case of a tree, the likelihood of a network will be dependent on the choice of the root, even if a reversible model of substitution is applied. Thus, the method also provides a way in which to root a phylogenetic network by choosing a node that produces a most likely network.
引用
收藏
页码:875 / 881
页数:7
相关论文
共 39 条
[1]  
[Anonymous], 1971, STAT DECISION THEORY
[2]  
[Anonymous], INFORM CLASSIFICATIO
[3]  
Bandelt Hans-Juergen, 1994, Verhandlungen des Naturwissenschaftlichen Vereins in Hamburg, V34, P51
[4]  
BANDELT HJ, 1995, GENETICS, V141, P743
[5]   A CANONICAL DECOMPOSITION-THEORY FOR METRICS ON A FINITE-SET [J].
BANDELT, HJ ;
DRESS, AWM .
ADVANCES IN MATHEMATICS, 1992, 92 (01) :47-105
[6]  
Buneman P., 1971, Mathematics in the Archaeological and Historical Sciences, P387
[7]   A guide to the literature on learning probabilistic networks from data [J].
Buntine, W .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1996, 8 (02) :195-210
[8]   Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables [J].
Chickering, DM ;
Heckerman, D .
MACHINE LEARNING, 1997, 29 (2-3) :181-212
[9]   Phylogenetic classification and the universal tree [J].
Doolittle, WF .
SCIENCE, 1999, 284 (5423) :2124-2128
[10]   Analyzing and visualizing sequence and distance data using SPLITSTREE [J].
Dress, A ;
Huson, D ;
Moulton, V .
DISCRETE APPLIED MATHEMATICS, 1996, 71 (1-3) :95-109