Finding motifs using random projections

被引:228
作者
Buhler, J
Tompa, M
机构
[1] Washington Univ, Dept Comp Sci, St Louis, MO 63130 USA
[2] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
motif finding; random projection; regulatory sequences;
D O I
10.1089/10665270252935430
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The DNA motif discovery problem abstracts the task of discovering short, conserved sites in genomic DNA. Pevzner and Sze recently described a precise combinatorial formulation of motif discovery that motivates the following algorithmic challenge: find twenty planted occurrences of a motif of length fifteen in roughly twelve kilobases of genomic sequence, where each occurrence of the motif differs from its consensus in four randomly chosen positions. Such "subtle" motifs, though statistically highly significant, expose a weakness in existing motif-finding algorithms, which typically fail to discover them. Pevzner and Sze introduced new algorithms to solve their (15,4)-motif challenge, but these methods do not scale efficiently to more difficult problems in the same family, such as the (14,4)-, (16,5)-, and (18,6)-motif problems. We introduce a novel motif-discovery algorithm, PROJECTION, designed to enhance the performance of existing motif finders using random projections of the input's substrings. Experiments on synthetic data demonstrate that PROJECTION remedies the weakness observed in existing algorithms, typically solving the difficult (14,4)-, (16,5)-, and (18,6)-motif problems. Our algorithm is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge. A probabilistic estimate suggests that related motif-finding problems that PROJECTION fails to solve are in all likelihood inherently intractable. We also test the performance of our algorithm on realistic biological examples, including transcription factor binding sites in eukaryotes and ribosome binding sites in prokaryotes.
引用
收藏
页码:225 / 242
页数:18
相关论文
共 40 条
  • [1] METAL-DEPENDENT BINDING OF A FACTOR INVIVO TO THE METAL-RESPONSIVE ELEMENTS OF THE METALLOTHIONEIN-1 GENE PROMOTER
    ANDERSEN, RD
    TAPLITZ, SJ
    WONG, S
    BRISTOL, G
    LARKIN, B
    HERSCHMAN, HR
    [J]. MOLECULAR AND CELLULAR BIOLOGY, 1987, 7 (10) : 3574 - 3581
  • [2] [Anonymous], GENES
  • [3] BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
  • [4] Blanchette M., 2001, P 5 ANN INT C COMP M, P49
  • [5] BLANCHETTE M, 2002, IN PRESS J COMP BIOL
  • [6] BOAM DSW, 1990, J BIOL CHEM, V265, P8285
  • [7] ON LIPSCHITZ EMBEDDING OF FINITE METRIC-SPACES IN HILBERT-SPACE
    BOURGAIN, J
    [J]. ISRAEL JOURNAL OF MATHEMATICS, 1985, 52 (1-2) : 46 - 52
  • [8] Predicting gene regulatory elements in silico on a genomic scale
    Brazma, A
    Jonassen, I
    Vilo, J
    Ukkonen, E
    [J]. GENOME RESEARCH, 1998, 8 (11) : 1202 - 1215
  • [9] Efficient large-scale sequence comparison by locality-sensitive hashing
    Buhler, J
    [J]. BIOINFORMATICS, 2001, 17 (05) : 419 - 428
  • [10] EXPECTATION MAXIMIZATION ALGORITHM FOR IDENTIFYING PROTEIN-BINDING SITES WITH VARIABLE LENGTHS FROM UNALIGNED DNA FRAGMENTS
    CARDON, LR
    STORMO, GD
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1992, 223 (01) : 159 - 170