SPLASH: structural pattern localization analysis by sequential histograms

被引:75
作者
Califano, A [1 ]
机构
[1] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
关键词
D O I
10.1093/bioinformatics/16.4.341
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The discovery of sparse amino acid patterns that match repeatedly in a set of protein sequences is an important problem in computational biology. Statistically significant patterns, that is patterns that occur more frequently than expected, may identify regions that have been preserved by evolution and which may therefore play a key functional or structural role. Sparseness can be important because a handful of non-contiguous residues may play a key role, while others, in between, may be changed without significant loss of function or structure. Similar arguments may be applied to conserved DNA patterns. Available sparse pattern discovery algorithms are either inefficient or impose limitations on the type of patterns that can be discovered. Results: This paper introduces a deterministic pattern discovery algorithm, called Splash, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Some examples of biologically interesting motifs discovered by Splash are reported for the histone I and for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used to systematically and exhaustively identify conserved regions in protein family sets. These can then be used to build accurate and sensitive PSSM or HMM models for sequence analysis.
引用
收藏
页码:341 / 357
页数:17
相关论文
共 24 条
  • [1] Methods and statistics for combining motif match scores
    Bailey, TL
    Gribskov, M
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (02) : 211 - 221
  • [2] PROSITE - A DICTIONARY OF SITES AND PATTERNS IN PROTEINS
    BAIROCH, A
    [J]. NUCLEIC ACIDS RESEARCH, 1991, 19 : 2241 - 2245
  • [3] PROTEIN DATA BANK - COMPUTER-BASED ARCHIVAL FILE FOR MACROMOLECULAR STRUCTURES
    BERNSTEIN, FC
    KOETZLE, TF
    WILLIAMS, GJB
    MEYER, EF
    BRICE, MD
    RODGERS, JR
    KENNARD, O
    SHIMANOUCHI, T
    TASUMI, M
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1977, 112 (03) : 535 - 542
  • [4] Approaches to the automatic discovery of patterns in biosequences
    Brazma, A
    Jonassen, I
    Eidhammer, I
    Gilbert, D
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (02) : 279 - 305
  • [5] Prediction of local structure in proteins using a library of sequence-structure motifs
    Bystroff, C
    Baker, D
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1998, 281 (03) : 565 - 577
  • [6] Califano A, 1993, Proc Int Conf Intell Syst Mol Biol, V1, P56
  • [7] HOMONUCLEAR AND HETERONUCLEAR 2-DIMENSIONAL NMR-STUDIES OF THE GLOBULAR DOMAIN OF HISTONE-H1 - SEQUENTIAL ASSIGNMENT AND SECONDARY STRUCTURE
    CERF, C
    LIPPENS, G
    MUYLDERMANS, S
    SEGERS, A
    RAMAKRISHNAN, V
    WODAK, SJ
    HALLENGA, K
    WYNS, L
    [J]. BIOCHEMISTRY, 1993, 32 (42) : 11345 - 11351
  • [8] Dayhoff M., 1978, ATLAS PROTEIN SEQ ST, V5, P353
  • [9] Dayhoff M.O., 1978, Atlas of Protein Sequence and Structure, P345
  • [10] GERRETSEN JCH, 1962, LECT TENSOR CALCULUS