Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features

被引：337

作者：

Ghandi, Mahmoud ^{[1
]}

Lee, Dongwon ^{[1
]}

Mohammad-Noori, Morteza ^{[2
,3
]}

Beer, Michael A. ^{[1
,4
]}

机构：

[1] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD USA

[2] Univ Tehran, Sch Math Stat & Comp Sci, Tehran, Iran

[3] Inst Res Fundamental Sci IPM, Sch Comp Sci, Tehran, Iran

[4] Johns Hopkins Univ, McKusick Nathans Inst Genet Med, Baltimore, MD USA

来源：

PLOS COMPUTATIONAL BIOLOGY | 2014年 / 10卷 / 07期

关键词：

STRING KERNELS; BINDING SITES; CHIP-SEQ; TRANSCRIPTION; GENE; CTCF;

D O I：

10.1371/journal.pcbi.1003711

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naive-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

引用

页数：15

共 38 条

[1]

Amanchy Ramars, 2011, J Proteomics Bioinform, V4, P22

[2]

[Anonymous], MAMMALIAN ENHANCER P

[3]

[Anonymous], PLOS COMPUT BIOL

[4] Sequence and chromatin determinants of cell-type-specific transcription factor binding [J].

Arvey, Aaron ;

Agius, Phaedra ;

Noble, William Stafford ;

Leslie, Christina .

GENOME RESEARCH, 2012, 22 (09) :1723-1734

[5] Predicting gene expression from sequence [J].

Beer, MA ;

Tavazoie, S .

CELL, 2004, 117 (02) :185-198

[6] Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities [J].

Berger, Michael F. ;

Philippakis, Anthony A. ;

Qureshi, Aaron M. ;

He, Fangxue S. ;

Estep, Preston W., III ;

Bulyk, Martha L. .

NATURE BIOTECHNOLOGY, 2006, 24 (11) :1429-1435

[7]

Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401

[8] JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update [J].

Bryne, Jan Christian ;

Valen, Eivind ;

Tang, Man-Hung Eric ;

Marstrand, Troels ;

Winther, Ole ;

da Piedade, Isabelle ;

Krogh, Anders ;

Lenhard, Boris ;

Sandelin, Albin .

NUCLEIC ACIDS RESEARCH, 2008, 36 :D102-D106

[9]

Cormen T., 2001, Introduction to Algorithms

[10] kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets [J].

Fletez-Brant, Christopher ;

Lee, Dongwon ;

McCallion, Andrew S. ;

Beer, Michael A. .

NUCLEIC ACIDS RESEARCH, 2013, 41 (W1) :W544-W556

← 1 2 3 4 →