Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities

被引:43
作者
Gracy, J [1 ]
Argos, P [1 ]
机构
[1] European Mol Biol Lab, D-69012 Heidelberg, Germany
关键词
D O I
10.1093/bioinformatics/14.2.174
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Decomposing each protein into modular domains is a basic prerequisite to classify accurately structural units in biological molecules. Boundaries between domains are indicated by two similar- amino acid sequence segments located within the same protein (repeats) ol within homologous proteins at notably different distances from their respective N- or C-termini. Results: We have developed an automated method that combines such positional constraints derived from various detected pairwise sequence similarities to delineate the modular organization of proteins. The procedure has been applied to a non-redundant data set of 26 990 proteins whose sequences were taken from the PIR and SWISS-PROT databanks and shared <60% sequence identity amongst pairs. The resultant clustering, delineation and multiple alignment of 24 380 sequence fragments yielded a new database of 4364 domain families. Comparison of the domain collection with that of PRODOM indicates a clear improvement in the number and size of domain families, domain boundaries and multiple sequence alignments. The accuracy and sensitivity of the method are illustrated by results obtained for ankyrin-like repeats and EGF-like modules.
引用
收藏
页码:174 / 187
页数:14
相关论文
共 17 条