Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis

被引:162
作者
Bussemaker, HJ [1 ]
Li, H [1 ]
Siggia, ED [1 ]
机构
[1] Rockefeller Univ, Ctr Studies Phys & Biol, New York, NY 10021 USA
关键词
D O I
10.1073/pnas.180265397
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The availability of complete genome sequences and mRNA expression data for all genes creates new opportunities and challenges for identifying DNA sequence motifs that control gene expression. An algorithm, "MobyDick," is presented that decomposes a set of DNA sequences into the most probable dictionary of motifs or words. This method is applicable to any set of DNA sequences: for example, all upstream regions in a genome or all genes expressed under certain conditions. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter ones of various lengths, eliminating the need for a separate set of reference data to define probabilities. We have built a dictionary with 1,200 words for the 6,000 upstream regulatory regions in the yeast genome; the 500 most significant words (some with as few as 10 copies in all of the upstream regions) match 114 of 443 experimentally determined sites (a significance level of 18 standard deviations). When analyzing all of the genes up-regulated during sporulation as a group, we find many motifs in addition to the few previously identified by analyzing the subclusters individually to the expression subclusters. Applying MobyDick to the genes derepressed when the general repressor Tup1 is deleted, we find known as well as putative binding sites for its regulatory partners.
引用
收藏
页码:10096 / 10100
页数:5
相关论文
共 23 条
  • [1] BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
  • [2] Predicting gene regulatory elements in silico on a genomic scale
    Brazma, A
    Jonassen, I
    Vilo, J
    Ukkonen, E
    [J]. GENOME RESEARCH, 1998, 8 (11) : 1202 - 1215
  • [3] CHU CH, 1998, SCIENCE, V279, P1896
  • [4] The transcriptional program of sporulation in budding yeast
    Chu, S
    DeRisi, J
    Eisen, M
    Mulholland, J
    Botstein, D
    Brown, PO
    Herskowitz, I
    [J]. SCIENCE, 1998, 282 (5389) : 699 - 705
  • [5] Deckert J, 1998, GENETICS, V150, P1429
  • [6] Exploring the metabolic and genetic control of gene expression on a genomic scale
    DeRisi, JL
    Iyer, VR
    Brown, PO
    [J]. SCIENCE, 1997, 278 (5338) : 680 - 686
  • [7] Durbin R., 1998, BIOL SEQUENCE ANAL
  • [8] Cluster analysis and display of genome-wide expression patterns
    Eisen, MB
    Spellman, PT
    Brown, PO
    Botstein, D
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (25) : 14863 - 14868
  • [9] GailusDurner V, 1996, MOL CELL BIOL, V16, P2777
  • [10] DETECTING SUBTLE SEQUENCE SIGNALS - A GIBBS SAMPLING STRATEGY FOR MULTIPLE ALIGNMENT
    LAWRENCE, CE
    ALTSCHUL, SF
    BOGUSKI, MS
    LIU, JS
    NEUWALD, AF
    WOOTTON, JC
    [J]. SCIENCE, 1993, 262 (5131) : 208 - 214