New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification

被引：80

作者：

Wojcik, J

Mornon, JP

Chomilier, J

机构：

[1] Univ Paris 06, Lab Mineral Cristallog, F-75252 Paris 05, France

[2] Univ Paris 07, CNRS UMR 7590, F-75252 Paris 05, France

[3] Fac Med Necker Enfants Malad, INSERM, U344, F-75015 Paris, France

来源：

JOURNAL OF MOLECULAR BIOLOGY | 1999年 / 289卷 / 05期

关键词：

proteins; loops; structure; modeling; database;

D O I：

10.1006/jmbi.1999.2826

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

A bank of 13,563 loops from three to eight amino acid residues long, representing motifs between two consecutive regular secondary structures, has been derived from protein structures presenting less than 95 % sequence identity. Statistical analyses of occurrences of conformations and residues revealed length-dependent over-representations of particular amino acids (glycine, proline, asparagine, serine, and aspartate) and conformations (alpha(L), epsilon, beta(P) regions of the Ramachandran plot). A position-dependent distribution of these occurrences was observed for N and C-terminal residues, which are correlated to the nature of the flanking regions. Loops of the same length were clustered into statistically meaningful families on the basis of their backbone structures when placed in a common reference frame, independent of the flanks. These clusters present significantly different distributions of sequence, conformations, and endpoint residue C-alpha distances. On the basis of the sequence-structure correlation of this clustering, an automatic loop modeling algorithm was developed. Based on the knowledge of its sequence and of its flank backbone structures each query loop is assigned to a family and target loop supports are selected in this family. The support backbones of these target loops are then adjusted on flanking structures by partial exploration of the conformational space. Loop closure is performed by energy minimization for each support and the final model is chosen among connected supports based upon energy criteria. The quality of the prediction is evaluated by the root-mean-square deviation (rmsd) between the final model and the native loops when the whole bank is re-attributed on itself with a Jackknife test. This average rmsd ranges from 1.1 Angstrom for three-residue loops to 3.8 Angstrom for eight-residue loops. A few poorly predicted loops are inescapable, considering the high level of diversity in loops and the lack of environment data. To overcome such modeling problems, a statistical reliability score was assigned for each prediction. This score is correlated to the quality of the prediction, in terms of rmsd, and thus improves the selection accuracy of the model. The algorithm efficiency was compared to CASP3 target loop predictions. Moreover, when tested on a test loop bank, this algorithm was shown to be robust when the loops are not precisely delimited, therefore proving to be a useful tool in practice for protein modeling. (C) 1999 Academic Press.

引用

页码：1469 / 1490

页数：22

共 73 条

[1]

[Anonymous], 1978, ATLAS PROTEIN SEQ ST

[2] PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES [J].