Likelihood-based clustering (LiBaC) for codon models, a method for grouping sites according to similarities in the underlying process of evolution

被引:10
作者
Bao, Le [1 ]
Gu, Hong [1 ]
Dunn, Katherine A. [2 ]
Bielawski, Joseph P. [1 ,2 ]
机构
[1] Dalhousie Univ, Dept Math & Stat, Halifax, NS, Canada
[2] Dalhousie Univ, Dept Biol, Halifax, NS, Canada
关键词
codon model; likelihood-based clustering; Bayes error rate; nonsynonymous/synonymous rate ratio; positive Darwinian selection;
D O I
10.1093/molbev/msn145
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Models of codon evolution are useful for investigating the strength and direction of natural selection via a parameter for the nonsynonymous/synonymous rate ratio (omega = d(N)/d(S)). Different codon models are available to account for diversity of the evolutionary patterns among sites. Codon models that specify data partitions as fixed effects allow the most evolutionary diversity among sites but require that site partitions are a priori identifiable. Models that use a parametric distribution to express the variability in the omega ratio across site do not require a priori partitioning of sites, but they permit less among-site diversity in the evolutionary process. Simulation studies presented in this paper indicate that differences among sites in estimates of omega under an overly simplistic analytical model can reflect more than just natural selection pressure. We also find that the classic likelihood ratio tests for positive selection have a high false-positive rate in some situations. In this paper, we developed a new method for assigning codon sites into groups where each group has a different model, and the likelihood over all sites is maximized. The method, called likelihood-based clustering (LiBaC), can be viewed as a generalization of the family of model-based clustering approaches to models of codon evolution. We report the performance of several LiBaC-based methods, and selected alternative methods, over a wide variety of scenarios. We find that LiBaC, under an appropriate model, can provide reliable parameter estimates when the process of evolution is very heterogeneous among groups of sites. Certain types of proteins, such as transmembrane proteins, are expected to exhibit such heterogeneity. A survey of genes encoding transmembrane proteins suggests that overly simplistic models could be leading to false signal for positive selection among such genes. In these cases, LiBaC-based methods offer an important addition to a "toolbox" of methods thereby helping to uncover robust evidence for the action of positive selection.
引用
收藏
页码:1995 / 2007
页数:13
相关论文
共 37 条
[1]   Accuracy and power of Bayes prediction of amino acid sites under positive selection [J].
Anisimova, M ;
Bielawski, JP ;
Yang, ZH .
MOLECULAR BIOLOGY AND EVOLUTION, 2002, 19 (06) :950-958
[2]   Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution [J].
Anisimova, M ;
Bielawski, JP ;
Yang, ZH .
MOLECULAR BIOLOGY AND EVOLUTION, 2001, 18 (08) :1585-1592
[3]  
[Anonymous], 1991, The Causes of Molecular Evolution
[4]   Large-scale analyses of synonymous substitution rates can be sensitive to assumptions about the process of mutation [J].
Aris-Brosou, Stephane ;
Bielawski, Joseph P. .
GENE, 2006, 378 :58-64
[5]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[6]   Methods for selecting fixed-effect models for heterogeneous codon evolution, with comments on their application to gene and genome data [J].
Bao, Le ;
Gu, Hong ;
Dunn, Katherine A. ;
Bielawski, Joseph P. .
BMC EVOLUTIONARY BIOLOGY, 2007, 7 (Suppl 1)
[7]  
BIELAWSKI JP, 2004, STAT METHODS MOL EVO
[8]   A CLASSIFICATION EM ALGORITHM FOR CLUSTERING AND 2 STOCHASTIC VERSIONS [J].
CELEUX, G ;
GOVAERT, G .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 1992, 14 (03) :315-332
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]  
Dimmic M W, 2000, Pac Symp Biocomput, P18