AUTOMATIC-GENERATION OF PRIMARY SEQUENCE PATTERNS FROM SETS OF RELATED PROTEIN SEQUENCES

被引:268
作者
SMITH, RF [1 ]
SMITH, TF [1 ]
机构
[1] HARVARD UNIV,SCH PUBL HLTH,BOSTON,MA 02115
关键词
amino acid; cluster analysis; dynamic programming; protein families; sequence comparison;
D O I
10.1073/pnas.87.1.118
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common 'root' pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a 'pay once' gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequenced family membership than any of the individual sequences used to construct the pattern.
引用
收藏
页码:118 / 122
页数:5
相关论文
共 26 条
[1]  
Aho A. V., 1980, FORMAL LANGUAGE THEO, P325
[2]  
ARBARBANEL RM, 1984, NUCLEIC ACIDS RES, V12, P263
[3]   DETERMINANTS OF A PROTEIN FOLD - UNIQUE FEATURES OF THE GLOBIN AMINO-ACID-SEQUENCES [J].
BASHFORD, D ;
CHOTHIA, C ;
LESK, AM .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 196 (01) :199-216
[4]   KNOWLEDGE-BASED PREDICTION OF PROTEIN STRUCTURES AND THE DESIGN OF NOVEL MOLECULES [J].
BLUNDELL, TL ;
SIBANDA, BL ;
STERNBERG, MJE ;
THORNTON, JM .
NATURE, 1987, 326 (6111) :347-352
[5]   REFINED CRYSTAL-STRUCTURE OF BOVINE BETA-TRYPSIN AT 1.8 A RESOLUTION .2. CRYSTALLOGRAPHIC REFINEMENT, CALCIUM-BINDING SITE, BENZAMIDINE BINDING-SITE AND ACTIVE-SITE AT PH 7.0 [J].
BODE, W ;
SCHWAGER, P .
JOURNAL OF MOLECULAR BIOLOGY, 1975, 98 (04) :693-717
[6]  
EMI M, 1986, GENE, V41, P305
[7]   THE PROTEIN IDENTIFICATION RESOURCE (PIR) [J].
GEORGE, DG ;
BARKER, WC ;
HUNT, LT .
NUCLEIC ACIDS RESEARCH, 1986, 14 (01) :11-15
[8]   PROFILE ANALYSIS - DETECTION OF DISTANTLY RELATED PROTEINS [J].
GRIBSKOV, M ;
MCLACHLAN, AD ;
EISENBERG, D .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1987, 84 (13) :4355-4358
[9]   THE PROTEIN-KINASE FAMILY - CONSERVED FEATURES AND DEDUCED PHYLOGENY OF THE CATALYTIC DOMAINS [J].
HANKS, SK ;
QUINN, AM ;
HUNTER, T .
SCIENCE, 1988, 241 (4861) :42-52
[10]  
HODGMAN TC, 1986, COMPUT APPL BIOSCI, V2, P181