Phylogenetic estimation of context-dependent substitution rates by maximum likelihood

被引:259
作者
Siepel, A [1 ]
Haussler, D
机构
[1] Univ Calif Santa Cruz, Ctr Biomol Sci & Engn, Santa Cruz, CA 95064 USA
[2] Univ Calif Santa Cruz, Howard Hughes Med Inst, Santa Cruz, CA 95064 USA
关键词
neighbor-dependent substitution; CpG effect; codon model; expectation maximization; substitution rate matrix;
D O I
10.1093/molbev/msh039
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N - 1 preceding sites, and the required. conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.
引用
收藏
页码:468 / 488
页数:21
相关论文
共 77 条
[1]  
Adachi J, 1996, MOLPHY VERSION 2 3 P
[2]  
Anderson E., 1999, LAPACK users' guide, V3rd ed.
[3]  
[Anonymous], 1971, STAT DECISION THEORY
[4]  
Arndt P.F., 2002, P 6 ANN INT C COMP B, P32
[5]   The compositional evolution of vertebrate genomes [J].
Bernardi, G .
GENE, 2000, 259 (1-2) :31-43
[6]   THE INFLUENCE OF NEAREST NEIGHBORS ON THE RATE AND PATTERN OF SPONTANEOUS POINT MUTATIONS [J].
BLAKE, RD ;
HESS, ST ;
NICHOLSONTUELL, J .
JOURNAL OF MOLECULAR EVOLUTION, 1992, 34 (03) :189-200
[7]  
BLANCHETTE M, 2004, IN PRESS GENOME RES
[8]   MAVID multiple alignment server [J].
Bray, N ;
Pachter, L .
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) :3525-3526
[9]  
BULMER M, 1986, MOL BIOL EVOL, V3, P322
[10]   Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes [J].
Cooper, GM ;
Brudno, M ;
Green, ED ;
Batzoglou, S ;
Sidow, A .
GENOME RESEARCH, 2003, 13 (05) :813-820