Evolutionary model selection with a genetic algorithm: A case study using stem RNA

被引：16

作者：

Kosakovsky Pond, Sergei L. ^{[1
]}

Mannino, Frank V.

Gravenor, Michael B.

Muse, Spencer V.

Frost, Simon D. W.

机构：

[1] Univ Calif San Diego, Dept Pathol, La Jolla, CA 92093 USA

[2] N Carolina State Univ, Bioinformat Res Ctr, Raleigh, NC 27695 USA

[3] Univ Coll Swansea, Sch Med, Swansea, W Glam, Wales

来源：

MOLECULAR BIOLOGY AND EVOLUTION | 2007年 / 24卷 / 01期

关键词：

RNA sequence evolution; secondary structure; model selection; genetic algorithms; multimodel inference;

D O I：

10.1093/molbev/msl144

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

The choice of a probabilistic model to describe sequence evolution can and should be justified. Underfitting the data through the use of overly simplistic models may miss out on interesting phenomena and lead to incorrect inferences. Overfitting the data with models that are too complex may ascribe biological meaning to statistical artifacts and result in falsely significant findings. We describe a likelihood-based approach for evolutionary model selection. The procedure employs a genetic algorithm (GA) to quickly explore a combinatorially large set of all possible time-reversible Markov models with a fixed number of substitution rates. When applied to stem RNA data subject to well-understood evolutionary forces, the models found by the GA 1) capture the expected overall rate patterns a priori; 2) fit the data better than the best available models based on a priori assumptions, suggesting subtle substitution patterns not previously recognized; 3) cannot be rejected in favor of the general reversible model, implying that the evolution of stem RNA sequences can be explained well with only a few substitution rate parameters; and 4) perform well on simulated data, both in terms of goodness of fit and the ability to estimate evolutionary rates. We also investigate the utility of several distance measures for comparing and contrasting inferred evolutionary models. Using widely available small computer clusters, our approach allows, for the first time, to evaluate the performance of existing RNA evolutionary models by comparing them with a large pool of candidate models and to validate common modeling assumptions. In addition, the new method provides the foundation for rigorous selection and comparison of substitution models for other types of sequence data.

引用

页码：159 / 170

页数：12

共 41 条

[1]

Abramowitz M., 1972, HDB MATH FUNCTIONS F

[2] Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA [J].

Adachi, J ;

Waddell, PJ ;

Martin, W ;

Hasegawa, M .

JOURNAL OF MOLECULAR EVOLUTION, 2000, 50 (04) :348-358

[3]

Adachi J, 1996, J MOL EVOL, V42, P459

[4] NEW LOOK AT STATISTICAL-MODEL IDENTIFICATION [J].

AKAIKE, H .

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1974, AC19 (06) :716-723

[5]

AKAIKE H, 1983, INT STAT I, V44, P139

[6] Comparative performance of Bayesian and AIC-based measures of phylogenetic model uncertainty [J].

Alfaro, ME ;

Huelsenbeck, JP .

SYSTEMATIC BIOLOGY, 2006, 55 (01) :89-96

[7]

Burnham KP, 2000, MODEL SELECTION MULT

[8] rtREV: An amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny [J].

Dimmic, MW ;

Rest, JS ;

Mindell, DP ;

Goldstein, RA .

JOURNAL OF MOLECULAR EVOLUTION, 2002, 55 (01) :65-73

[9] Relaxed phylogenetics and dating with confidence [J].

Drummond, Alexei J. ;

Ho, Simon Y. W. ;

Phillips, Matthew J. ;

Rambaut, Andrew .

PLOS BIOLOGY, 2006, 4 (05) :699-710

[10]

Eshelman L. J., 1991, FDN GENETIC ALGORITH, V1, P265, DOI DOI 10.1016/B978-0-08-050684-5.50020-3

← 1 2 3 4 5 →