Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features

被引:337
作者
Ghandi, Mahmoud [1 ]
Lee, Dongwon [1 ]
Mohammad-Noori, Morteza [2 ,3 ]
Beer, Michael A. [1 ,4 ]
机构
[1] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD USA
[2] Univ Tehran, Sch Math Stat & Comp Sci, Tehran, Iran
[3] Inst Res Fundamental Sci IPM, Sch Comp Sci, Tehran, Iran
[4] Johns Hopkins Univ, McKusick Nathans Inst Genet Med, Baltimore, MD USA
关键词
STRING KERNELS; BINDING SITES; CHIP-SEQ; TRANSCRIPTION; GENE; CTCF;
D O I
10.1371/journal.pcbi.1003711
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naive-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.
引用
收藏
页数:15
相关论文
共 38 条
[1]  
Amanchy Ramars, 2011, J Proteomics Bioinform, V4, P22
[2]  
[Anonymous], MAMMALIAN ENHANCER P
[3]  
[Anonymous], PLOS COMPUT BIOL
[4]   Sequence and chromatin determinants of cell-type-specific transcription factor binding [J].
Arvey, Aaron ;
Agius, Phaedra ;
Noble, William Stafford ;
Leslie, Christina .
GENOME RESEARCH, 2012, 22 (09) :1723-1734
[5]   Predicting gene expression from sequence [J].
Beer, MA ;
Tavazoie, S .
CELL, 2004, 117 (02) :185-198
[6]   Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities [J].
Berger, Michael F. ;
Philippakis, Anthony A. ;
Qureshi, Aaron M. ;
He, Fangxue S. ;
Estep, Preston W., III ;
Bulyk, Martha L. .
NATURE BIOTECHNOLOGY, 2006, 24 (11) :1429-1435
[7]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[8]   JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update [J].
Bryne, Jan Christian ;
Valen, Eivind ;
Tang, Man-Hung Eric ;
Marstrand, Troels ;
Winther, Ole ;
da Piedade, Isabelle ;
Krogh, Anders ;
Lenhard, Boris ;
Sandelin, Albin .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D102-D106
[9]  
Cormen T., 2001, Introduction to Algorithms
[10]   kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets [J].
Fletez-Brant, Christopher ;
Lee, Dongwon ;
McCallion, Andrew S. ;
Beer, Michael A. .
NUCLEIC ACIDS RESEARCH, 2013, 41 (W1) :W544-W556