Microbial gene identification using interpolated Markov models

被引:726
作者
Salzberg, SL
Delcher, AL
Kasif, S
White, O
机构
[1] Inst Genom Res, Rockville, MD 20850 USA
[2] Johns Hopkins Univ, Dept Comp Sci, Baltimore, MD 21218 USA
[3] Loyola Coll, Dept Comp Sci, Baltimore, MD 21210 USA
[4] Univ Illinois, Dept Elect Engn & Comp Sci, Chicago, IL 60607 USA
关键词
D O I
10.1093/nar/26.2.544
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae, Helicobacter pylori and other complete microbial genomes, this system has proven to be very accurate at locating virtually ail the genes in these sequences, outperforming previous methods. A conservative estimate based on experiments on H.pylori and H.influenzae is that the system finds >97% of all genes, GLIMMER uses Interpolated Markov models (IMMs) as a framework for capturing dependencies between nearby nucleotides in a DNA sequence. An IMM-based method makes predictions based on a variable context; i.e., a variable-length oligomer in a DNA sequence, The context used by GLIMMER changes depending on the local composition of the sequence, As a result, GLIMMER is more flexible and more powerful than fixed-order Markov methods, which have previously been the primary content-based technique for finding genes in microbial DNA.
引用
收藏
页码:544 / 548
页数:5
相关论文
共 14 条
  • [1] ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
  • [2] DETECTION OF NEW GENES IN A BACTERIAL GENOME USING MARKOV-MODELS FOR 3 GENE CLASSES
    BORODOVSKY, M
    MCININCH, JD
    KOONIN, EV
    RUDD, KE
    MEDIGUE, C
    DANCHIN, A
    [J]. NUCLEIC ACIDS RESEARCH, 1995, 23 (17) : 3554 - 3562
  • [3] GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS
    BORODOVSKY, M
    MCININCH, J
    [J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 123 - 133
  • [4] WHOLE-GENOME RANDOM SEQUENCING AND ASSEMBLY OF HAEMOPHILUS-INFLUENZAE RD
    FLEISCHMANN, RD
    ADAMS, MD
    WHITE, O
    CLAYTON, RA
    KIRKNESS, EF
    KERLAVAGE, AR
    BULT, CJ
    TOMB, JF
    DOUGHERTY, BA
    MERRICK, JM
    MCKENNEY, K
    SUTTON, G
    FITZHUGH, W
    FIELDS, C
    GOCAYNE, JD
    SCOTT, J
    SHIRLEY, R
    LIU, LI
    GLODEK, A
    KELLEY, JM
    WEIDMAN, JF
    PHILLIPS, CA
    SPRIGGS, T
    HEDBLOM, E
    COTTON, MD
    UTTERBACK, TR
    HANNA, MC
    NGUYEN, DT
    SAUDEK, DM
    BRANDON, RC
    FINE, LD
    FRITCHMAN, JL
    FUHRMANN, JL
    GEOGHAGEN, NSM
    GNEHM, CL
    MCDONALD, LA
    SMALL, KV
    FRASER, CM
    SMITH, HO
    VENTER, JC
    [J]. SCIENCE, 1995, 269 (5223) : 496 - 512
  • [5] FRASER CM, 1997, NATURE, V390, P680
  • [6] Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop, P381
  • [7] COMPARISON OF METHODS FOR SEARCHING PROTEIN-SEQUENCE DATABASES
    PEARSON, WR
    [J]. PROTEIN SCIENCE, 1995, 4 (06) : 1145 - 1160
  • [8] A UNIVERSAL DATA-COMPRESSION SYSTEM
    RISSANEN, J
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1983, 29 (05) : 656 - 664
  • [9] Ristad E., 1997, INT C AC SPEECH SIGN
  • [10] Ron D, 1996, MACH LEARN, V25, P117, DOI 10.1007/BF00114008