On the total number of genes and their length distribution in complete microbial genomes

被引:145
作者
Skovgaard, M [1 ]
Jensen, LJ [1 ]
Brunak, S [1 ]
Ussery, D [1 ]
Krogh, A [1 ]
机构
[1] Tech Univ Denmark, Bioctr, Ctr Biol Sequence Anal, DK-2800 Lyngby, Denmark
关键词
D O I
10.1016/S0168-9525(01)02372-1
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
In sequenced microbial genomes, some of the annotated genes are actually not protein-coding genes, but rather open reading frames that occur by chance. Therefore, the number of annotated genes is higher than the actual number of genes for most of these microbes. Comparison of the length distribution of the annotated genes with the length distribution of those matching a known protein reveals that too many short genes are annotated in many genomes. Here we estimate the true number of protein-coding genes for sequenced genomes. Although it is often claimed that Escherichia coli has about 4300 genes, we show that it probably has only similar to 3800 genes, and that a similar discrepancy exists for almost all published genomes.
引用
收藏
页码:425 / 428
页数:4
相关论文
共 11 条
  • [1] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [2] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [3] Structural and genomic correlates of hyperthermostability
    Cambillau, C
    Claverie, JM
    [J]. JOURNAL OF BIOLOGICAL CHEMISTRY, 2000, 275 (42) : 32383 - 32386
  • [4] Biology's new Rosetta stone
    Das, S
    Yu, LH
    Gaitatzes, C
    Rogers, R
    Freeman, J
    Bienkowska, J
    Adams, RM
    Smith, TF
    Lindellen, J
    [J]. NATURE, 1997, 385 (6611) : 29 - 30
  • [5] HOBOHM U, 1992, PROTEIN SCI, V1, P409
  • [6] The 1999 SWISS-2DPAGE database update
    Hoogland, C
    Sanchez, JC
    Tonella, L
    Binz, PA
    Bairoch, A
    Hochstrasser, DF
    Appel, RD
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 286 - 288
  • [7] Kawarabayasi Y, 1999, DNA Res, V6, P83, DOI 10.1093/dnares/6.2.83
  • [8] NATALE DA, 2000, GENOME BIOL, P1
  • [9] Silverman B.W., 1986, Monographs on Statistics and Applied Probability, DOI [10.1201/9781315140919, 10.2307/2347507, DOI 10.2307/2347507]
  • [10] The COG database: new developments in phylogenetic classification of proteins from complete genomes
    Tatusov, RL
    Natale, DA
    Garkavtsev, IV
    Tatusova, TA
    Shankavaram, UT
    Rao, BS
    Kiryutin, B
    Galperin, MY
    Fedorova, ND
    Koonin, EV
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (01) : 22 - 28