Bounded search for de novo identification of degenerate cis-regulatory elements

被引：18

作者：

Carlson, Jonathan M.

Chakravarty, Arijit

Khetani, Radhika S.

Gross, Robert H. ^{[1
]}

机构：

[1] Dartmouth Coll, Dept Biol, Hanover, NH 03755 USA

[2] Millennium Pharmaceut Inc, Dept Canc Pharmacol, Cambridge, MA 02138 USA

[3] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98105 USA

来源：

BMC BIOINFORMATICS | 2006年 / 7卷 / 1期

基金：

美国国家科学基金会;

关键词：

D O I：

10.1186/1471-2105-7-254

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: The identification of statistically overrepresented sequences in the upstream regions of coregulated genes should theoretically permit the identification of potential cisregulatory elements. However, in practice many cis-regulatory elements are highly degenerate, precluding the use of an exhaustive word-counting strategy for their identification. While numerous methods exist for inferring base distributions using a position weight matrix, recent studies suggest that the independence assumptions inherent in the model, as well as the inability to reach a global optimum, limit this approach. Results: In this paper, we report PRISM, a degenerate motif finder that leverages the relationship between the statistical significance of a set of binding sites and that of the individual binding sites. PRISM first identifies overrepresented, non-degenerate consensus motifs, then iteratively relaxes each one into a high-scoring degenerate motif. This approach requires no tunable parameters, thereby lending itself to unbiased performance comparisons. We therefore compare PRISM's performance against nine popular motif finders on 28 well-characterized S. cerevisiae regulons. PRISM consistently outperforms all other programs. Finally, we use PRISM to predict the binding sites of uncharacterized regulons. Our results support a proposed mechanism of action for the yeast cell-cycle transcription factor StbI, whose binding site has not been determined experimentally. Conclusion: The relationship between statistical measures of the binding sites and the set as a whole leads to a simple means of identifying the diverse range of cis-regulatory elements to which a protein binds. This approach leverages the advantages of word-counting, in that position dependencies are implicitly accounted for and local optima are more easily avoided. While we sacrifice guaranteed optimality to prevent the exponential blowup of exhaustive search, we prove that the error is bounded and experimentally show that the performance is superior to other methods. A Java implementation of this algorithm can be downloaded from our web server at http://genie.dartmouth.edu/prism.

引用

页数：17

共 43 条

[1]

[Anonymous], 1997, ALGORITHMS STRINGS T, DOI DOI 10.1017/CBO9780511574931

[2]

BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379

[3]

BARASH Y, 2003, RECOMB03 P 7 INT C C

[4] Is there a code for protein-DNA recognition? Probab(ilistical)ly ... [J].