Classifying molecular sequences using a linkage graph with their pairwise similarities

被引:70
作者
Matsuda, H [1 ]
Ishihara, T [1 ]
Hashimoto, A [1 ]
机构
[1] Osaka Univ, Grad Sch Engn Sci, Dept Informat & Math Sci, Osaka 5608531, Japan
关键词
protein sequence classification; graph covering; clique problem;
D O I
10.1016/S0304-3975(98)00091-7
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a method for classifying a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of sequence similarity, we use similarity score computed by a method based on the dynamic programming (DP), such as the Smith-Waterman local alignment algorithm. Although comparison by DP based method is very sensitive, when given sequences include a family of sequences that are much diverged in evolutionary process, similarity among some of them may be hidden behind spurious similarity of some unrelated sequences. Also the distance derived from the similarity score may not be metric (i.e., triangle inequality may not hold) when some sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a family of sequences with a confidence measure. We prove that a restricted version of the p-quasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size greater than or equal to k or not) is NP-complete. Thus we present an approximation algorithm for classifying a set of sequences using p-quasi complete subgraphs. The effectiveness of our method is demonstrated by the result of classifying over 4000 protein sequences on the Escherichia coli genome that was completely determined recently. (C) 1999-Elsevier Science B.V. All rights reserved.
引用
收藏
页码:305 / 325
页数:21
相关论文
共 29 条
  • [1] Aiba H, 1996, DNA Res, V3, P363, DOI 10.1093/dnares/3.6.363
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] [Anonymous], 2010, Dynamic programming
  • [4] [Anonymous], J MOL BIOL
  • [5] [Anonymous], 1979, Computers and Intractablity: A Guide to the Theoryof NP-Completeness
  • [6] [Anonymous], 1978, Atlas of protein sequence and structure
  • [7] The SWISS-PROT protein sequence data bank and its supplement TrEMBL
    Bairoch, A
    Apweller, R
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (01) : 31 - 36
  • [8] The PROSITE database, its status in 1997
    Bairoch, A
    Bucher, P
    Hofmann, K
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (01) : 217 - 221
  • [9] AN ATPASE DOMAIN COMMON TO PROKARYOTIC CELL-CYCLE PROTEINS, SUGAR KINASES, ACTIN, AND HSP70 HEAT-SHOCK PROTEINS
    BORK, P
    SANDER, C
    VALENCIA, A
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (16) : 7290 - 7294
  • [10] GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS
    BORODOVSKY, M
    MCININCH, J
    [J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 123 - 133