iTriplet, a rule-based nucleic acid sequence motif finder

被引:24
作者
Ho, Eric S. [1 ]
Jakubowski, Christopher D. [1 ]
Gunderson, Samuel I. [1 ]
机构
[1] Rutgers State Univ, Dept Mol Biol & Biochem, Nelson Labs, Piscataway, NJ 08854 USA
基金
美国国家科学基金会;
关键词
BIOLOGICAL SEQUENCES; DOWNSTREAM ELEMENTS; GENOME BROWSER; DISCOVERY; POLYADENYLATION; ALGORITHM;
D O I
10.1186/1748-7188-4-14
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: With the advent of high throughput sequencing techniques, large amounts of sequencing data are readily available for analysis. Natural biological signals are intrinsically highly variable making their complete identification a computationally challenging problem. Many attempts in using statistical or combinatorial approaches have been made with great success in the past. However, identifying highly degenerate and long (>20 nucleotides) motifs still remains an unmet challenge as high degeneracy will diminish statistical significance of biological signals and increasing motif size will cause combinatorial explosion. In this report, we present a novel rule-based method that is focused on finding degenerate and long motifs. Our proposed method, named iTriplet, avoids costly enumeration present in existing combinatorial methods and is amenable to parallel processing. Results: We have conducted a comprehensive assessment on the performance and sensitivity-specificity of iTriplet in analyzing artificial and real biological sequences in various genomic regions. The results show that iTriplet is able to solve challenging cases. Furthermore we have confirmed the utility of iTriplet by showing it accurately predicts polyA-site-related motifs using a dual Luciferase reporter assay. Conclusion: iTriplet is a novel rule-based combinatorial or enumerative motif finding method that is able to process highly degenerate and long motifs that have resisted analysis by other methods. In addition, iTriplet is distinguished from other methods of the same family by its parallelizability, which allows it to leverage the power of today's readily available high-performance computing systems.
引用
收藏
页数:14
相关论文
共 37 条
[1]  
Bailey T. L., 1994, Proc. Int. Conf. Intell. Syst. Mol. Biol., V2, P28
[2]  
BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
[3]   ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins [J].
Bakheet, T ;
Frevel, M ;
Williams, BRG ;
Greer, W ;
Khabar, KSA .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :246-254
[4]   Discovery of regulatory elements by a computational method for phylogenetic footprinting [J].
Blanchette, M ;
Tompa, M .
GENOME RESEARCH, 2002, 12 (05) :739-748
[5]   Recognition of GU-rich polyadenylation regulatory elements by human CstF-64 protein [J].
Cañadillas, JMP ;
Varani, G .
EMBO JOURNAL, 2003, 22 (11) :2821-2830
[6]   AU-RICH ELEMENTS - CHARACTERIZATION AND IMPORTANCE IN MESSENGER-RNA DEGRADATION [J].
CHEN, CYA ;
SHYU, AB .
TRENDS IN BIOCHEMICAL SCIENCES, 1995, 20 (11) :465-470
[7]   Auxiliary downstream elements are required for efficient polyadenylation of mammalian pre-mRNAs [J].
Chen, F ;
Wilusz, J .
NUCLEIC ACIDS RESEARCH, 1998, 26 (12) :2891-2898
[8]   WebLogo: A sequence logo generator [J].
Crooks, GE ;
Hon, G ;
Chandonia, JM ;
Brenner, SE .
GENOME RESEARCH, 2004, 14 (06) :1188-1190
[9]   A survey of DNA motif finding algorithms [J].
Das, Modan K. ;
Dai, Ho-Kwok .
BMC BIOINFORMATICS, 2007, 8 (Suppl 7)
[10]   Fast and practical algorithms for planted (l, d) motif search [J].
Davila, Jaime ;
Balla, Sudha ;
Rajasekaran, Sanguthevar .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2007, 4 (04) :544-552