Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified

被引:933
作者
Keane, TM [1 ]
Creevey, CJ
Pentony, MM
Naughton, TJ
McInerney, JO
机构
[1] Natl Univ Ireland, Dept Biol, Bioinformat Lab, Maynooth, Kildare, Ireland
[2] EMBL Heidelberg, Bork Grp, Heidelberg, Germany
[3] UCL, Dept Comp Sci, London, England
[4] Natl Univ Ireland, Dept Comp Sci, Maynooth, Kildare, Ireland
关键词
D O I
10.1186/1471-2148-6-29
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner. Results: We start by highlighting the potential dangers of arbitrarily choosing protein models by demonstrating an empirical example where a single alignment can produce two topologically different and strongly supported phylogenies using two different arbitrarily-chosen amino acid substitution models. We demonstrate that in simple simulations, statistical methods of model selection are indeed robust and likely to be useful for protein model selection. We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets. Perhaps most interestingly, we demonstrate that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins. Conclusion: This demonstrates that choosing protein models based on their source or method of construction may not be appropriate.
引用
收藏
页数:17
相关论文
共 57 条
[1]   ProtTest: selection of best-fit models of protein evolution [J].
Abascal, F ;
Zardoya, R ;
Posada, D .
BIOINFORMATICS, 2005, 21 (09) :2104-2105
[2]   Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation [J].
Abdo, Z ;
Minin, VN ;
Joyce, P ;
Sullivan, J .
MOLECULAR BIOLOGY AND EVOLUTION, 2005, 22 (03) :691-703
[3]  
Adachi J, 1996, J MOL EVOL, V42, P459
[4]   NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].
AKAIKE, H .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723
[5]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[6]   Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA [J].
Anderson, FE ;
Swofford, DL .
MOLECULAR PHYLOGENETICS AND EVOLUTION, 2004, 33 (02) :440-451
[7]  
[Anonymous], 1972, ATLAS PROTEIN SEQUEN
[8]   Archaeal phylogeny based on proteins of the transcription and translation machineries:: tackling the Methanopyrus kandleri paradox -: art. no. R17 [J].
Brochier, C ;
Forterre, P ;
Gribaldo, S .
GENOME BIOLOGY, 2004, 5 (03)
[9]   Topological bias and inconsistency of maximum likelihood using wrong models [J].
Bruno, WJ ;
Halpern, AL .
MOLECULAR BIOLOGY AND EVOLUTION, 1999, 16 (04) :564-566
[10]   The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support [J].
Buckley, TR ;
Cunningham, CW .
MOLECULAR BIOLOGY AND EVOLUTION, 2002, 19 (04) :394-405