On the convergence of a clustering algorithm for protein-coding regions in microbial genomes

被引：15

作者：

Baldi, P ^{[1
]}

机构：

[1] Univ Calif Irvine, Coll Med, Dept Biol Chem, Irvine, CA 92717 USA

来源：

BIOINFORMATICS | 2000年 / 16卷 / 04期

关键词：

D O I：

10.1093/bioinformatics/16.4.367

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: As the number of fully sequenced prokaryotic genomes continues to grow rapidly, computational methods for reliably detecting protein-coding regions become even more important. Audic and Claverie (1998) Proc. Natl Acad. Sci. USA, 95 10026-10031, have proposed a clustering algorithm for protein-coding regions in microbial genomes. The algorithm is based on three Markov models of order k associated with subsequences extracted front a given genome. The parameters of the three Markov models are recursively updated by the algorithm which, in simulations, always appear to converge to a unique stable partition of the genome. The partition corresponds to three kinds of regions: (1) coding on the direct strand, (2) coding on the complementary strand, (3) non-coding. Results: Here we provide an explanation for the convergence of the algorithm by observing that it is essentially a form of the expectation maximization (EM) algorithm applied to the corresponding mixture model. We also provide a partial justification for the uniqueness of the partition based on identifiability. Other possible variations and improvements are briefly discussed.

引用

页码：367 / 371

页数：5

共 17 条

[1] Self-identification of protein-coding regions in microbial genomes
Audic, S
Claverie, JM
[J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (17) : 10026 - 10031
[2] BALDI P, 1994, NEURAL COMPUT, V6, P305
[3] Baldi P., 1998, Bioinformatics: The machine learning approach
[4] DETECTION OF NEW GENES IN A BACTERIAL GENOME USING MARKOV-MODELS FOR 3 GENE CLASSES
BORODOVSKY, M
MCININCH, JD
KOONIN, EV
RUDD, KE
MEDIGUE, C
DANCHIN, A
[J]. NUCLEIC ACIDS RESEARCH, 1995, 23 (17) : 3554 - 3562
[5] DERIVING NONHOMOGENEOUS DNA MARKOV-CHAIN MODELS BY CLUSTER-ANALYSIS ALGORITHM MINIMIZING MULTIPLE ALIGNMENT ENTROPY
BORODOVSKY, M
PERESETSKY, A
[J]. COMPUTERS & CHEMISTRY, 1994, 18 (03): : 259 - 267
[6] GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS
BORODOVSKY, M
MCININCH, J
[J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 123 - 133
[7] Prediction of complete gene structures in human genomic DNA
Burge, C
Karlin, S
[J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) : 78 - 94
[8] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
DEMPSTER, AP
LAIRD, NM
RUBIN, DB
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
[9] Durbin R., 1998, BIOL SEQUENCE ANAL P
[10] Everitt B. S., 1981, FINITE MIXTURE DISTR

← 1 2 →