Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

被引:39
作者
Izquierdo-Carrasco, Fernando [1 ]
Smith, Stephen A. [1 ,2 ]
Stamatakis, Alexandros [1 ]
机构
[1] Heidelberg Inst Theoret Studies, Sci Comp Grp, Exelixis Lab, D-69118 Heidelberg, Germany
[2] Univ Michigan, Dept Ecol & Evolutionary Biol, Smith Lab 2, Ann Arbor, MI 48109 USA
来源
BMC BIOINFORMATICS | 2011年 / 12卷
基金
美国国家科学基金会;
关键词
MULTIPLE SEQUENCE ALIGNMENT; DNA-SEQUENCES; RAXML; RATES; MODEL;
D O I
10.1186/1471-2105-12-470
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood. Results: We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times and memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the G model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems. Conclusions: We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.
引用
收藏
页数:14
相关论文
共 32 条
[1]  
[Anonymous], 2006, GENETIC ALGORITHM AP
[2]   Open source clustering software [J].
de Hoon, MJL ;
Imoto, S ;
Nolan, J ;
Miyano, S .
BIOINFORMATICS, 2004, 20 (09) :1453-1454
[3]   BEAST: Bayesian evolutionary analysis by sampling trees [J].
Drummond, Alexei J. ;
Rambaut, Andrew .
BMC EVOLUTIONARY BIOLOGY, 2007, 7 (1)
[4]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797
[5]   EVOLUTIONARY TREES FROM DNA-SEQUENCES - A MAXIMUM-LIKELIHOOD APPROACH [J].
FELSENSTEIN, J .
JOURNAL OF MOLECULAR EVOLUTION, 1981, 17 (06) :368-376
[6]   INDELible: A Flexible Simulator of Biological Sequence Evolution [J].
Fletcher, William ;
Yang, Ziheng .
MOLECULAR BIOLOGY AND EVOLUTION, 2009, 26 (08) :1879-1888
[7]   Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups [J].
Goloboff, Pablo A. ;
Catalano, Santiago A. ;
Mirande, J. Marcos ;
Szumik, Claudia A. ;
Arias, J. Salvador ;
Kallersjo, Mari ;
Farris, James S. .
CLADISTICS, 2009, 25 (03) :211-230
[8]   New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0 [J].
Guindon, Stephane ;
Dufayard, Jean-Francois ;
Lefort, Vincent ;
Anisimova, Maria ;
Hordijk, Wim ;
Gascuel, Olivier .
SYSTEMATIC BIOLOGY, 2010, 59 (03) :307-321
[9]   Recent developments in the MAFFT multiple sequence alignment program [J].
Katoh, Kazutaka ;
Toh, Hiroyuki .
BRIEFINGS IN BIOINFORMATICS, 2008, 9 (04) :286-298
[10]   A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process [J].
Lartillot, N ;
Philippe, H .
MOLECULAR BIOLOGY AND EVOLUTION, 2004, 21 (06) :1095-1109