Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment

被引:59
作者
Gracy, J [1 ]
Argos, P [1 ]
机构
[1] European Mol Biol Lab, D-69012 Heidelberg, Germany
关键词
D O I
10.1093/bioinformatics/14.2.164
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Genome sequencing projects require the periodic application of analysis tools that can classify and multiply align related protein sequence domains. Full automation of this task requires an efficient integration of similarity and alignment techniques. Results: We have developed a fully automated process that classifies entire protein sequence databases, resulting in alignment of the homologous sequences. The successive steps of the procedure ar-e based on compositional and local sequence similarity searches followed by multiple sequence alignments. Global similarities are detected from the pairwise comparison of amino acid and dipeptide compositions of each protein. After the elimination of all but one sequence from each detected cluster of closely related proteins, the remaining sequences are compiled in a suffix tl ee which is self-compared to detect local sequence similarities. Sets of proteins which share similar sequence segments are then weighted according to their closeness and multiply aligned using a fast hierarchical dynamic programming algorithm. Computational strategies were devised to minimize computer processing time and memory space requirements. The accuracy of the sequence classifications has been evaluated for 12 462 primary structures distributed over 341 known families. The percentage of sequences with missed or incorrect family assignments was 6.8% on the test set. This low en or level is only twice that of the manually constructed PROSITE database (3.4%) and is substantially better than that found for the automatically built PRODOM database (34.9%).
引用
收藏
页码:164 / 173
页数:10
相关论文
共 35 条
  • [1] Aho A.V., 1974, The Design and Analysis of Computer Algorithms
  • [2] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [3] PRINTS - A PROTEIN MOTIF FINGERPRINT DATABASE
    ATTWOOD, TK
    BECK, ME
    [J]. PROTEIN ENGINEERING, 1994, 7 (07): : 841 - 848
  • [4] The SWISS-PROT protein sequence data bank and its new supplement TREMBL
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 1996, 24 (01) : 21 - 25
  • [5] The PROSITE database, its status in 1995
    Bairoch, A
    Bucher, P
    Hofmann, K
    [J]. NUCLEIC ACIDS RESEARCH, 1996, 24 (01) : 189 - 196
  • [6] AMINO-ACID SUBSTITUTION DURING FUNCTIONALLY CONSTRAINED DIVERGENT EVOLUTION OF PROTEIN SEQUENCES
    BENNER, SA
    COHEN, MA
    GONNET, GH
    [J]. PROTEIN ENGINEERING, 1994, 7 (11): : 1323 - 1332
  • [7] CHAO KM, 1992, COMPUT APPL BIOSCI, V8, P481
  • [8] INFORMATION ENHANCEMENT METHODS FOR LARGE-SCALE SEQUENCE-ANALYSIS
    CLAVERIE, JM
    STATES, DJ
    [J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 191 - 201
  • [9] Etzold T, 1996, METHOD ENZYMOL, V266, P114
  • [10] The PIR-International protein sequence database
    George, DG
    Barker, WC
    Mewes, HW
    Pfeiffer, F
    Tsugita, A
    [J]. NUCLEIC ACIDS RESEARCH, 1996, 24 (01) : 17 - 20