DNA sequence classification via an expectation maximization algorithm and neural networks: A case study

被引:33
作者
Ma, QC [2 ]
Wang, JTL
Shasha, D
Wu, CH
机构
[1] New Jersey Inst Technol, Dept Comp Sci, Newark, NJ 07102 USA
[2] Novartis Pharmaceut Corp, Summit, NJ 07901 USA
[3] NYU, Courant Inst Math Sci, New York, NY 10012 USA
[4] Georgetown Univ, Med Ctr, Natl Biomed Res Fdn, Washington, DC 20007 USA
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS | 2001年 / 31卷 / 04期
基金
美国国家科学基金会;
关键词
Bayesian inference; bioinformatics; data mining; expectation maximization (EM); neural networks (NNs); promoter recognition;
D O I
10.1109/5326.983930
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as the spacer between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network (NN) for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different datasets.
引用
收藏
页码:468 / 475
页数:8
相关论文
共 33 条
[1]  
[Anonymous], 1989, STAT ANAL DISCRETE D
[2]  
ASH RB, 1965, INFORMATION THEORY
[3]  
BAILEY TL, 1995, MACH LEARN, V21, P51, DOI 10.1007/BF00993379
[4]  
BERGER J. O., 2013, Statistical Decision Theory and Bayesian Analysis, DOI [10.1007/978-1-4757-4286-2, DOI 10.1007/978-1-4757-4286-2]
[5]  
Bishop C. M., 1995, NEURAL NETWORKS PATT
[6]   EXPECTATION MAXIMIZATION ALGORITHM FOR IDENTIFYING PROTEIN-BINDING SITES WITH VARIABLE LENGTHS FROM UNALIGNED DNA FRAGMENTS [J].
CARDON, LR ;
STORMO, GD .
JOURNAL OF MOLECULAR BIOLOGY, 1992, 223 (01) :159-170
[7]   A statistical model for locating regulatory regions in genomic DNA [J].
Crowley, EM ;
Roeder, K ;
Bina, M .
JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) :8-14
[8]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[9]   RIGOROUS PATTERN-RECOGNITION METHODS FOR DNA-SEQUENCES - ANALYSIS OF PROMOTER SEQUENCES FROM ESCHERICHIA-COLI [J].
GALAS, DJ ;
EGGERT, M ;
WATERMAN, MS .
JOURNAL OF MOLECULAR BIOLOGY, 1985, 186 (01) :117-128
[10]  
Han J., 2006, Data Mining: Concepts and Techniques, V340, P93205