Phylogenetic assessment of alignments reveals neglected tree signal in gaps

被引:92
作者
Dessimoz, Christophe [1 ,2 ]
Gil, Manuel [1 ,2 ]
机构
[1] ETH, Dept Comp Sci, CH-8092 Zurich, Switzerland
[2] Swiss Inst Bioinformat, CH-8092 Zurich, Switzerland
来源
GENOME BIOLOGY | 2010年 / 11卷 / 04期
关键词
MULTIPLE SEQUENCE ALIGNMENT; MAXIMUM-LIKELIHOOD; ALGORITHM; ACCURATE; PROTEIN; IMPROVEMENT; INFERENCE; MODELS; INFORMATION; DIVERGENT;
D O I
10.1186/gb-2010-11-4-r37
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism. Results: Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees. Conclusions: This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.
引用
收藏
页数:9
相关论文
共 54 条
[1]   The information content of an ambiguously alignable region, a case study of the trnL intron from the Rhamnaceae [J].
Aagesen, L .
ORGANISMS DIVERSITY & EVOLUTION, 2004, 4 (1-2) :35-49
[2]   Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods [J].
Altenhoff, Adrian M. ;
Dessimoz, Christophe .
PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (01)
[3]  
Blackshields Gordon, 2006, In Silico Biol, V6, P321
[4]   THE RELATION BETWEEN THE DIVERGENCE OF SEQUENCE AND STRUCTURE IN PROTEINS [J].
CHOTHIA, C ;
LESK, AM .
EMBO JOURNAL, 1986, 5 (04) :823-826
[5]  
Dessimoz C, 2005, LECT NOTES COMPUT SC, V3678, P61
[6]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[7]   Phylogenetic inference under varying proportions of indel-induced alignment gaps [J].
Dwivedi, Bhakti ;
Gadagkar, Sudhindra R. .
BMC EVOLUTIONARY BIOLOGY, 2009, 9
[8]   MUSCLE: a multiple sequence alignment method with reduced time and space complexity [J].
Edgar, RC .
BMC BIOINFORMATICS, 2004, 5 (1) :1-19
[9]   Multiple sequence alignment [J].
Edgar, Robert C. ;
Batzoglou, Serafim .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 2006, 16 (03) :368-373
[10]   DISTINGUISHING HOMOLOGOUS FROM ANALOGOUS PROTEINS [J].
FITCH, WM .
SYSTEMATIC ZOOLOGY, 1970, 19 (02) :99-&