On the convergence of a clustering algorithm for protein-coding regions in microbial genomes

被引:15
作者
Baldi, P [1 ]
机构
[1] Univ Calif Irvine, Coll Med, Dept Biol Chem, Irvine, CA 92717 USA
关键词
D O I
10.1093/bioinformatics/16.4.367
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: As the number of fully sequenced prokaryotic genomes continues to grow rapidly, computational methods for reliably detecting protein-coding regions become even more important. Audic and Claverie (1998) Proc. Natl Acad. Sci. USA, 95 10026-10031, have proposed a clustering algorithm for protein-coding regions in microbial genomes. The algorithm is based on three Markov models of order k associated with subsequences extracted front a given genome. The parameters of the three Markov models are recursively updated by the algorithm which, in simulations, always appear to converge to a unique stable partition of the genome. The partition corresponds to three kinds of regions: (1) coding on the direct strand, (2) coding on the complementary strand, (3) non-coding. Results: Here we provide an explanation for the convergence of the algorithm by observing that it is essentially a form of the expectation maximization (EM) algorithm applied to the corresponding mixture model. We also provide a partial justification for the uniqueness of the partition based on identifiability. Other possible variations and improvements are briefly discussed.
引用
收藏
页码:367 / 371
页数:5
相关论文
共 17 条
  • [1] Self-identification of protein-coding regions in microbial genomes
    Audic, S
    Claverie, JM
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (17) : 10026 - 10031
  • [2] BALDI P, 1994, NEURAL COMPUT, V6, P305
  • [3] Baldi P., 1998, Bioinformatics: The machine learning approach
  • [4] DETECTION OF NEW GENES IN A BACTERIAL GENOME USING MARKOV-MODELS FOR 3 GENE CLASSES
    BORODOVSKY, M
    MCININCH, JD
    KOONIN, EV
    RUDD, KE
    MEDIGUE, C
    DANCHIN, A
    [J]. NUCLEIC ACIDS RESEARCH, 1995, 23 (17) : 3554 - 3562
  • [5] DERIVING NONHOMOGENEOUS DNA MARKOV-CHAIN MODELS BY CLUSTER-ANALYSIS ALGORITHM MINIMIZING MULTIPLE ALIGNMENT ENTROPY
    BORODOVSKY, M
    PERESETSKY, A
    [J]. COMPUTERS & CHEMISTRY, 1994, 18 (03): : 259 - 267
  • [6] GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS
    BORODOVSKY, M
    MCININCH, J
    [J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 123 - 133
  • [7] Prediction of complete gene structures in human genomic DNA
    Burge, C
    Karlin, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) : 78 - 94
  • [8] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [9] Durbin R., 1998, BIOL SEQUENCE ANAL P
  • [10] Everitt B. S., 1981, FINITE MIXTURE DISTR