Rapid Likelihood Analysis on Large Phylogenies Using Partial Sampling of Substitution Histories

被引:18
作者
de Koning, A. P. Jason
Gu, Wanjun
Pollock, David D. [1 ]
机构
[1] Univ Colorado Denver, Sch Med, Dept Biochem & Mol Genet, Denver, CO 80202 USA
基金
美国国家卫生研究院;
关键词
likelihood analysis; time complexity; substitution histories; MCMC; data augmentation; PROTEIN EVOLUTION; DNA-SEQUENCES; UNIFORMIZATION; INFERENCE; ALIGNMENT; PATTERNS; MRBAYES; TREES; RATES;
D O I
10.1093/molbev/msp228
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Likelihood-based approaches can reconstruct evolutionary processes in greater detail and with better precision from larger data sets. The extremely large comparative genomic data sets that are now being generated thus create new opportunities for understanding molecular evolution, but analysis of such large quantities of data poses escalating computational challenges. Recently developed Markov chain Monte Carlo methods that augment substitution histories are a promising approach to alleviate these computational costs. We analyzed the computational costs of several such approaches, considering how they scale with model and data set complexity. This provided a theoretical framework to understand the most important computational bottlenecks, leading us to combine novel variations of our conditional pathway integration approach with recent advances made by others. The resulting technique ("partial sampling" of substitution histories) is considerably faster than all other approaches we considered. It is accurate, simple to implement, and scales exceptionally well with dimensions of model complexity and data set size. In particular, the time complexity of sampling unobserved substitution histories using the new method is much faster than previously existing methods, and model parameter and branch length updates are independent of data set size. We compared the performance of methods on a 224-taxon set of mammalian cytochrome-b sequences. For a simple nucleotide substitution model, partial sampling was at least 10 times faster than the PhyloBayes program, which samples substitutions in continuous time, and about 100 times faster than when using fully integrated substitution histories. Under a general reversible model of amino acid substitution, the partial sampling method was 1,600 times faster than when using fully integrated substitution histories, confirming significantly improved scaling with model state-space complexity. Partial sampling of substitutions thus dramatically improves the utility of likelihood approaches for analyzing complex evolutionary processes on large data sets.
引用
收藏
页码:249 / 265
页数:17
相关论文
共 39 条
[1]  
[Anonymous], 2009, R: A language and environment for statistical computing
[2]  
Cormen TH., 2009, Introduction to Algorithms, V3
[3]   Ancestral Sequence Reconstruction in Primate Mitochondrial DNA: Compositional Bias and Effect on Functional Inference (vol 21, pg 1871, 2004) [J].
de Koning, A. P. Jason ;
Gu, Wanjun ;
Pollock, David D. .
MOLECULAR BIOLOGY AND EVOLUTION, 2009, 26 (02) :481-481
[4]   EVOLUTIONARY TREES FROM DNA-SEQUENCES - A MAXIMUM-LIKELIHOOD APPROACH [J].
FELSENSTEIN, J .
JOURNAL OF MOLECULAR EVOLUTION, 1981, 17 (06) :368-376
[5]  
Gelman A., 1992, BAYESIAN DATA ANAL
[6]   STOCHASTIC RELAXATION, GIBBS DISTRIBUTIONS, AND THE BAYESIAN RESTORATION OF IMAGES [J].
GEMAN, S ;
GEMAN, D .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1984, 6 (06) :721-741
[7]  
Golub GH., 1989, MATRIX COMPUTATIONS, DOI DOI 10.56021/9781421407944
[8]   Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems [J].
Grasso, C ;
Lee, C .
BIOINFORMATICS, 2004, 20 (10) :1546-1556
[9]   Modeling the site-specific variation of selection patterns along lineages [J].
Guindon, S ;
Rodrigo, AG ;
Dyer, KA ;
Huelsenbeck, JP .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (35) :12957-12962
[10]   MONTE-CARLO SAMPLING METHODS USING MARKOV CHAINS AND THEIR APPLICATIONS [J].
HASTINGS, WK .
BIOMETRIKA, 1970, 57 (01) :97-&