Effects of choice of DNA sequence model structure on gene identification accuracy

被引:7
作者
Azad, RK
Borodovsky, M [1 ]
机构
[1] Georgia Inst Technol, Sch Biol, Atlanta, GA 30332 USA
[2] Georgia Inst Technol, Sch Biomed Engn, Atlanta, GA 30332 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/bth028
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Markov chain models of DNA sequences have frequently been used in gene finding algorithms. Performance of the algorithm critically depends on the model structure and parameters. Still, the issue of choosing the model structure has not been studied with sufficient attention. Results: We have assessed performance of several types of Markov chain models, both fixed order (FO) models and models with interpolation, within the framework of the GeneMark algorithm. The performance was measured in two ways: (i) the accuracy of detection of protein-coding potential in artificial DNA sequences and (ii) the accuracy of identifying genes in real prokaryotic genomes. We observed that the models built by deleted interpolation (DI) slightly outperformed other models in detecting protein-coding potential in artificial DNA sequences with GC content in the medium range and also in detecting genes in real genomes with medium GC content. For artificial and real genomic DNA with high or low GC content, we observed that the models built by DI were in some cases slightly outperformed by FO models.
引用
收藏
页码:993 / 1005
页数:13
相关论文
共 17 条
[1]   A MAXIMUM-LIKELIHOOD APPROACH TO CONTINUOUS SPEECH RECOGNITION [J].
BAHL, LR ;
JELINEK, F ;
MERCER, RL .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1983, 5 (02) :179-190
[2]  
BAHL LR, 1991, P EUR C SPEECH COMM, P1209
[3]   GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J].
Besemer, J ;
Lomsadze, A ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 2001, 29 (12) :2607-2618
[4]  
BORODOVSKII MY, 1986, MOL BIOL+, V20, P1144
[5]  
BORODOVSKII MY, 1986, MOL BIOL+, V20, P833
[6]   GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS [J].
BORODOVSKY, M ;
MCININCH, J .
COMPUTERS & CHEMISTRY, 1993, 17 (02) :123-133
[7]   Improved microbial gene identification with GLIMMER [J].
Delcher, AL ;
Harmon, D ;
Kasif, S ;
White, O ;
Salzberg, SL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (23) :4636-4641
[8]  
Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop, P381
[9]   ESTIMATION OF PROBABILITIES FROM SPARSE DATA FOR THE LANGUAGE MODEL COMPONENT OF A SPEECH RECOGNIZER [J].
KATZ, SM .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1987, 35 (03) :400-401
[10]   GeneMark.hmm: new solutions for gene finding [J].
Lukashin, AV ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 1998, 26 (04) :1107-1115