RecMotif: a novel fast algorithm for weak motif discovery

被引:17
作者
Sun, He Quan [1 ]
Low, Malcolm Yoke Hean [1 ]
Hsu, Wen Jing [1 ]
Rajapakse, Jagath C. [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore
来源
BMC BIOINFORMATICS | 2010年 / 11卷
关键词
FACTOR-BINDING SITES; REGULATORY MOTIFS; FINDING MOTIFS; SEQUENCES; DNA; IDENTIFICATION; DICTIONARY; PATTERNS;
D O I
10.1186/1471-2105-11-S11-S8
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Weak motif discovery in DNA sequences is an important but unresolved problem in computational biology. Previous algorithms that aimed to solve the problem usually require a large amount of memory or execution time. In this paper, we proposed a fast and memory efficient algorithm, RecMotif, which guarantees to discover all motifs with specific (l, d) settings (where l is the motif length and d is the maximum number of mutations between a motif instance and the true motif). Results: Comparisons with several recently proposed algorithms have shown that RecMotif is more scalable for handling longer and weaker motifs. For instance, it can solve the open challenge cases such as (40, 14) within 5 hours while the other algorithms compared failed due to either longer execution times or shortage of memory space. For real biological sequences, such as E. coli CRP, RecMotif is able to accurately discover the motif instances with (l, d) as (18, 6) in less than 1 second, which is faster than the other algorithms compared. Conclusions: RecMotif is a novel algorithm that requires only a space complexity of O(m(2)n) (where m is the number of sequences in the data and n is the length of the sequences).
引用
收藏
页数:11
相关论文
共 26 条
[1]  
Bailey TL., 1994, Proc Int Conf Intel Syst Mol Biol, V2, P28
[2]   Finding motifs using random projections [J].
Buhler, J ;
Tompa, M .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2002, 9 (02) :225-242
[3]   Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis [J].
Bussemaker, HJ ;
Li, H ;
Siggia, ED .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2000, 97 (18) :10096-10100
[4]  
DAVILA J, 2006, 6 INT C COMP SCI ICC, P822
[5]   Fast and practical algorithms for planted (l, d) motif search [J].
Davila, Jaime ;
Balla, Sudha ;
Rajasekaran, Sanguthevar .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2007, 4 (04) :544-552
[6]   MotifCut: regulatory motifs finding with maximum density subgraphs [J].
Fratkin, Eugene ;
Naughton, Brian T. ;
Brutlag, Douglas L. ;
Batzoglou, Serafim .
BIOINFORMATICS, 2006, 22 (14) :E150-E157
[7]   Identifying DNA and protein patterns with statistically significant alignments of multiple sequences [J].
Hertz, GZ ;
Stormo, GD .
BIOINFORMATICS, 1999, 15 (7-8) :563-577
[8]  
HERTZ GZ, 1990, COMPUT APPL BIOSCI, V6, P81
[9]   iTriplet, a rule-based nucleic acid sequence motif finder [J].
Ho, Eric S. ;
Jakubowski, Christopher D. ;
Gunderson, Samuel I. .
ALGORITHMS FOR MOLECULAR BIOLOGY, 2009, 4
[10]   Finding motifs in the twilight zone [J].
Keich, U ;
Pevzner, PA .
BIOINFORMATICS, 2002, 18 (10) :1374-1381