NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence

被引:85
作者
Down, TA [1 ]
Hubbard, TJP [1 ]
机构
[1] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
基金
英国惠康基金;
关键词
D O I
10.1093/nar/gki282
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.
引用
收藏
页码:1445 / 1453
页数:9
相关论文
共 25 条
[1]  
Arnone MI, 1997, DEVELOPMENT, V124, P1851
[2]  
Bailey T., 1994, P 2 INT C INT SYST M, P28
[3]  
Barash Y., 2003, P 7 ANN INT C COMP M, P28
[4]  
BAVESI G, 2004, NUCLEIC ACIDS RES, V32, pW199
[5]   Additivity in protein-DNA interactions: how good an approximation is it? [J].
Benos, PV ;
Bulyk, ML ;
Stormo, GD .
NUCLEIC ACIDS RESEARCH, 2002, 30 (20) :4442-4451
[6]   Drosophila DNase I footprint database:: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster [J].
Bergman, CM ;
Carlson, JW ;
Celniker, SE .
BIOINFORMATICS, 2005, 21 (08) :1747-1749
[7]   WEIGHT MATRIX DESCRIPTIONS OF 4 EUKARYOTIC RNA POLYMERASE-II PROMOTER ELEMENTS DERIVED FROM 502 UNRELATED PROMOTER SEQUENCES [J].
BUCHER, P .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 212 (04) :563-578
[8]   OVER-REPRESENTATION AND UNDER-REPRESENTATION OF SHORT OLIGONUCLEOTIDES IN DNA-SEQUENCES [J].
BURGE, C ;
CAMPBELL, AM ;
KARLIN, S .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (04) :1358-1362
[9]   INDEPENDENT COMPONENT ANALYSIS, A NEW CONCEPT [J].
COMON, P .
SIGNAL PROCESSING, 1994, 36 (03) :287-314
[10]  
Durbin R., 1998, BIOL SEQUENCE ANAL