An improved general amino acid replacement matrix

被引:2298
作者
Le, Si Quang [1 ]
Gascuel, Olivier [1 ]
机构
[1] Univ Montpellier 2, CNRS, LIRMM, Montpellier, France
关键词
amino acid substitutions; replacement matrices; JTT; WAG; maximum likelihood estimations; phylogenetic inference;
D O I
10.1093/molbev/msn067
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising similar to 50,000 sequences and similar to 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.
引用
收藏
页码:1307 / 1320
页数:14
相关论文
共 49 条
[1]  
Abascal F, 2007, MOL BIOL EVOL, V24, P1
[2]   Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA [J].
Adachi, J ;
Waddell, PJ ;
Martin, W ;
Hasegawa, M .
JOURNAL OF MOLECULAR EVOLUTION, 2000, 50 (04) :348-358
[3]  
Adachi J, 1996, J MOL EVOL, V42, P459
[4]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[5]  
[Anonymous], 1978, Atlas of protein sequence and structure
[6]  
[Anonymous], 2004, Inferring Phylogenies
[7]   Estimation of reversible substitution matrices from multiple pairs of sequences [J].
Arvestad, L ;
Bruno, WJ .
JOURNAL OF MOLECULAR EVOLUTION, 1997, 45 (06) :696-703
[8]   Efficient methods for estimating amino acid replacement rates [J].
Arvestad, Lars .
JOURNAL OF MOLECULAR EVOLUTION, 2006, 62 (06) :663-673
[9]  
Bateman A, 2002, NUCLEIC ACIDS RES, V30, P276, DOI [10.1093/nar/gkr1065, 10.1093/nar/gkp985, 10.1093/nar/gkh121]
[10]   AMINO-ACID SUBSTITUTION DURING FUNCTIONALLY CONSTRAINED DIVERGENT EVOLUTION OF PROTEIN SEQUENCES [J].
BENNER, SA ;
COHEN, MA ;
GONNET, GH .
PROTEIN ENGINEERING, 1994, 7 (11) :1323-1332