Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules

被引:21
作者
Boeva, Valentina [1 ,2 ]
Clement, Julien [3 ]
Regnier, Mireille [2 ]
Roytberg, Mikhail A. [4 ,5 ]
Makeev, Vsevolod J. [1 ,6 ]
机构
[1] GosNIIGenetika, Inst Genet & Select Ind Microorganisms, Moscow 117545, Russia
[2] INRIA Rocquencourt, MIGEC, F-78153 Le Chesnay, France
[3] CNRS, UMR 6072, GREYC, Lab Informat, F-14032 Caen, France
[4] Russian Acad Sci, Inst Math Problems Biol, Pushchino 142292, Moscow Region, Russia
[5] Puschino State Univ, Pushchino, Moscow Region, Russia
[6] Russian Acad Sci, Engelhardt Mol Biol Inst, Moscow, Russia
关键词
D O I
10.1186/1748-7188-2-13
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap. Results: We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k(1),...,k(s) or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA. Method: The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n vertical bar Sigma vertical bar(m vertical bar H vertical bar + K vertical bar sigma vertical bar(K)) Pi(i)k(i)) time complexity, where n is the length of the text, vertical bar Sigma vertical bar is the alphabet size, m is the maximal motif length, vertical bar H vertical bar is the total number of words in motifs, K is the order of Markov model, and ki is the number of occurrences of the ith motif. Conclusion: The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can alsobe used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.
引用
收藏
页数:15
相关论文
共 63 条
[21]   Finding motifs in promoter regions [J].
Hertzberg, L ;
Zuk, O ;
Getz, G ;
Domany, E .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2005, 12 (03) :314-330
[22]   Combinatorial motif analysis and hypothesis generation on a genomic scale [J].
Hu, YJ ;
Sandmeyer, S ;
McLaughlin, C ;
Kibler, D .
BIOINFORMATICS, 2000, 16 (03) :222-232
[23]   Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes [J].
Jegga, AG ;
Sherwood, SP ;
Carman, JW ;
Pinski, AT ;
Phillips, JL ;
Pestian, JP ;
Aronow, BJ .
GENOME RESEARCH, 2002, 12 (09) :1408-1417
[24]  
Jun S, 1996, DEVELOPMENT, V122, P2639
[25]   Identifying combinatorial regulation of transcription factors and binding motifs [J].
Kato, M ;
Hata, N ;
Banerjee, N ;
Futcher, B ;
Zhang, MQ .
GENOME BIOLOGY, 2004, 5 (08)
[26]   Detecting localized repeats in genomic sequences:: A new strategy and its application to Bacillus subtilis and Arabidopsis thaliana sequences [J].
Klaerr-Blanchard, M ;
Chiapello, H ;
Coward, E .
COMPUTERS & CHEMISTRY, 2000, 24 (01) :57-70
[27]  
Klingenhoff A., 2002, In Silico Biology, V2, pS17
[28]  
Krivan William, 2004, J Bioinform Comput Biol, V2, P413, DOI 10.1142/S021972000400065X
[29]  
Kucherov G, 2004, LECT NOTES COMPUT SC, V3109, P297
[30]   Identification of the binding sites of regulatory proteins in bacterial genomes [J].
Li, H ;
Rhodius, V ;
Gross, C ;
Siggia, ED .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (18) :11772-11777