Discovery of conserved sequence patterns using a stochastic dictionary model

被引:40
作者
Gupta, M [1 ]
Liu, JS [1 ]
机构
[1] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
关键词
data augmentation; gene regulation; missing data; transcription factor binding site;
D O I
10.1198/016214503388619094
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missing-data approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.
引用
收藏
页码:55 / 66
页数:12
相关论文
共 16 条
  • [1] Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis
    Bussemaker, HJ
    Li, H
    Siggia, ED
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) : 10096 - 10100
  • [2] CYCLIC-AMP RECEPTOR PROTEIN - ROLE IN TRANSCRIPTION ACTIVATION
    DECROMBRUGGHE, B
    BUSBY, S
    BUC, H
    [J]. SCIENCE, 1984, 224 (4651) : 831 - 838
  • [3] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [4] AN EXPECTATION MAXIMIZATION (EM) ALGORITHM FOR THE IDENTIFICATION AND CHARACTERIZATION OF COMMON SITES IN UNALIGNED BIOPOLYMER SEQUENCES
    LAWRENCE, CE
    REILLY, AA
    [J]. PROTEINS-STRUCTURE FUNCTION AND GENETICS, 1990, 7 (01): : 41 - 51
  • [5] DETECTING SUBTLE SEQUENCE SIGNALS - A GIBBS SAMPLING STRATEGY FOR MULTIPLE ALIGNMENT
    LAWRENCE, CE
    ALTSCHUL, SF
    BOGUSKI, MS
    LIU, JS
    NEUWALD, AF
    WOOTTON, JC
    [J]. SCIENCE, 1993, 262 (5131) : 208 - 214
  • [6] Bayesian models for multiple local sequence alignment and Gibbs sampling strategies
    Liu, JS
    Neuwald, AF
    Lawrence, CE
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1995, 90 (432) : 1156 - 1170
  • [7] COVARIANCE STRUCTURE OF THE GIBBS SAMPLER WITH APPLICATIONS TO THE COMPARISONS OF ESTIMATORS AND AUGMENTATION SCHEMES
    LIU, JS
    WONG, WH
    KONG, A
    [J]. BIOMETRIKA, 1994, 81 (01) : 27 - 40
  • [8] Bayesian inference on biopolymer models
    Liu, JS
    Lawrence, CE
    [J]. BIOINFORMATICS, 1999, 15 (01) : 38 - 52
  • [9] Liu X, 2001, Pac Symp Biocomput, P127
  • [10] Meng XL, 1996, STAT SINICA, V6, P831