AUTOMATIC-GENERATION OF PRIMARY SEQUENCE PATTERNS FROM SETS OF RELATED PROTEIN SEQUENCES

被引：268

作者：

SMITH, RF ^{[1
]}

SMITH, TF ^{[1
]}

机构：

[1] HARVARD UNIV,SCH PUBL HLTH,BOSTON,MA 02115

来源：

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA | 1990年 / 87卷 / 01期

关键词：

amino acid; cluster analysis; dynamic programming; protein families; sequence comparison;

D O I：

10.1073/pnas.87.1.118

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common 'root' pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a 'pay once' gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequenced family membership than any of the individual sequences used to construct the pattern.

引用

页码：118 / 122

页数：5

共 26 条

[1]

Aho A. V., 1980, FORMAL LANGUAGE THEO, P325

[2]

ARBARBANEL RM, 1984, NUCLEIC ACIDS RES, V12, P263

[3] DETERMINANTS OF A PROTEIN FOLD - UNIQUE FEATURES OF THE GLOBIN AMINO-ACID-SEQUENCES [J].