Large-scale prokaryotic gene prediction and comparison to genome annotation

被引:108
作者
Nielsen, P [1 ]
Krogh, A [1 ]
机构
[1] Univ Copenhagen, Inst Mol Biol & Physiol, Bioinformat Ctr, DK-2100 Copenhagen, Denmark
关键词
D O I
10.1093/bioinformatics/bti701
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. Results: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to similar to 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by > 5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation. Availability: The EasyGene 1.2 predictions and statistics can be accessed at http://www.binf.ku.dk/cgi-bin/easygene/search Contact: pern@binf.ku.dk.
引用
收藏
页码:4322 / 4329
页数:8
相关论文
共 21 条
[1]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]   Heuristic approach to deriving models for gene finding [J].
Besemer, J ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (19) :3911-3920
[3]   GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J].
Besemer, J ;
Lomsadze, A ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 2001, 29 (12) :2607-2618
[4]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[5]   The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129 [J].
Cerdeño-Tárraga, AM ;
Efstratiou, A ;
Dover, LG ;
Holden, MTG ;
Pallen, M ;
Bentley, SD ;
Besra, GS ;
Churcher, C ;
James, KD ;
De Zoysa, A ;
Chillingworth, T ;
Cronin, A ;
Dowd, L ;
Feltwell, T ;
Hamlin, N ;
Holroyd, S ;
Jagels, K ;
Moule, S ;
Quail, MA ;
Rabbinowitsch, E ;
Rutherford, KM ;
Thomson, NR ;
Unwin, L ;
Whitehead, S ;
Barrell, BG ;
Parkhill, J .
NUCLEIC ACIDS RESEARCH, 2003, 31 (22) :6516-6523
[6]   Improved microbial gene identification with GLIMMER [J].
Delcher, AL ;
Harmon, D ;
Kasif, S ;
White, O ;
Salzberg, SL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (23) :4636-4641
[7]   Protein annotation: detective work for function prediction [J].
Doerks, T ;
Bairoch, A ;
Bork, P .
TRENDS IN GENETICS, 1998, 14 (06) :248-250
[8]  
Durbin R., 1998, BIOL SEQUENCE ANAL
[9]   Genome sequence of the hyperthermophilic crenarchaeon Pyrobaculum aerophilum [J].
Fitz-Gibbon, ST ;
Ladner, H ;
Kim, UJ ;
Stetter, KO ;
Simon, MI ;
Miller, JH .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (02) :984-989
[10]   Combining diverse evidence for gene recognition in completely sequenced bacterial genomes [J].
Frishman, D ;
Mironov, A ;
Mewes, HW ;
Gelfand, M .
NUCLEIC ACIDS RESEARCH, 1998, 26 (12) :2941-2947