Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments

被引:62
作者
Chang, MSS
Benner, SA [1 ]
机构
[1] Univ Florida, Dept Chem, Gainesville, FL 32611 USA
[2] Fdn Appl Mol Evolut, Gainesville, FL 32601 USA
关键词
insertions; deletions; protein evolution; alignments; gaps;
D O I
10.1016/j.jmb.2004.05.045
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
To understand how protein segments are inserted and deleted during divergent evolution, a set of pairwise alignments contained exactly one gap, and therefore arising from the first insertion-deletion (indel) event in the time separating the homologs, was examined. The alignments showed that "structure breaking" amino acids (PGDNS) were preferred within and flanking gapped regions, as are two residues with hydrophilic side-chains (QE) that frequently occur at the surface of protein folds. Conversely, hydrophobic residues (FMILYVW) occur infrequently within and flanking the gapped region. These preferences are modestly different in protein pairs separated by an episode of adaptive evolution, than in pairs diverging under strong functional constraints. Surprisingly, regions near an indel have not evolved more rapidly than the sequence pair overall, showing no evidence that an indel event must be compensated by local amino acid replacement. The gap-lengths are best approximated by a Zipfian distribution, with the probability of a gap of length L decreasing as a function of L-1.8. These features are largely independent of the length of the gap and the extent of divergence (measured by both silent and non-silent sequence changes) separating the two proteins. Surprisingly, amino acid repeats were discovered in more than a third of the polypeptide segments in and around the gap. These correspond to repeats in the DNA sequence. This suggests that a signature of the mechanism by which indels occur in the DNA sequence remains in the encoded protein sequences. These data suggest specific tools to score gap placement in an alignment. They also suggest tools that distinguish true indels from gaps created by mistaken gene finding, including under-predicted and overpredicted introns. By providing mechanisms to identify errors, the tools will enhance the value of genome sequence databases in support of integrated paleogenomics strategies used to extract functional information in a post-genomic environment. (C) 2004 Elsevier Ltd. All rights reserved.
引用
收藏
页码:617 / 631
页数:15
相关论文
共 41 条