SDM: A fast distance-based approach for (super) tree building in phylogenomics

被引:54
作者
Criscuolo, Alexis
Berry, Vincent
Douzery, Emmanuel J. P.
Gascuel, Olivier
机构
[1] Univ Montpellier 2, ISEM, Grp Phylogenie, F-34095 Montpellier 05, France
[2] Univ Montpellier 2, CNRS, LIRMM, Equipe Methodes & Algorithmes Bioinformat, F-34392 Montpellier 05, France
关键词
D O I
10.1080/10635150600969872
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Phylogenomic studies aim to build phylogenies from large sets of homologous genes. Such "genome-sized" data require fast methods, because of the typically large numbers of taxa examined. In this framework, distance-based methods are useful for exploratory studies and building a starting tree to be refined by a more powerful maximum likelihood (ML) approach. However, estimating evolutionary distances directly from concatenated genes gives poor topological signal as genes evolve at different rates. We propose a novel method, named super distance matrix (SDM), which follows the same line as average consensus supertree (ACS; Lapointe and Cucumel, 1997) and combines the evolutionary distances obtained from each gene into a single distance supermatrix to be analyzed using a standard distance-based algorithm. SDM deforms the source matrices, without modifying their topological message, to bring them as close as possible to each other; these deformed matrices are then averaged to obtain the distance supermatrix. We show that this problem is equivalent to the minimization of a least-squares criterion subject to linear constraints. This problem has a unique solution which is obtained by resolving a linear system. As this system is sparse, its practical resolution requires O(n(a)k(a)) time, where n is the number of taxa, k the number of matrices, and a < 2, which allows the distance supermatrix to be quickly obtained. Several uses of SDM are proposed, from fast exploratory studies to more accurate approaches requiring heavier computing time. Using simulations, we show that SDM is a relevant alternative to the standard matrix representation with parsimony (MRP) method, notably when the taxa sets of the different genes have low overlap. We also show that SDM can be used to build an excellent starting tree for an ML approach, which both reduces the computing time and increases the topogical accuracy. We use SDM to analyze the data set of Gatesy et al. ( 2002, Syst. Biol. 51: 652-664) that involves 48 genes of 75 placental mammals. The results indicate that these genes have strong rate heterogeneity and confirm the simulation conclusions.
引用
收藏
页码:740 / 755
页数:16
相关论文
共 76 条
[1]  
ANISIMOVA M, 2006, IN PRESS SYST BIOL
[2]  
BARTHELEMY JP, 1991, WILEY INTERSCIENCE S
[4]   On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain [J].
Berry, V ;
Gascuel, O .
MOLECULAR BIOLOGY AND EVOLUTION, 1996, 13 (07) :999-1011
[5]   Calculating the evolutionary rates of different genes: A fast, accurate estimator with applications to maximum likelihood phylogenetic analysis [J].
Bevan, RB ;
Lang, BF ;
Bryant, D .
SYSTEMATIC BIOLOGY, 2005, 54 (06) :900-915
[6]  
Bininda-Emonds O., 2004, PHYLOGENETIC SUPERTR
[7]   Assessment of the accuracy of matrix representation with parsimony analysis supertree construction [J].
Bininda-Emonds, ORP ;
Sanderson, MJ .
SYSTEMATIC BIOLOGY, 2001, 50 (04) :565-579
[8]  
Bininda-Emonds ORP, 1998, SYST BIOL, V47, P497
[9]  
Bourque M, 1978, . Ph.D. Dissertation
[10]  
BRYANT D, 2001, DIMACS AMS, P163