MetaGene: prokaryotic gene finding from environmental genome shotgun sequences

被引:737
作者
Noguchi, Hideki [1 ]
Park, Jungho [1 ]
Takagi, Toshihisa [1 ]
机构
[1] Univ Tokyo, Grad Sch Frontier Sci, Dept Computat Biol, Kashiwa, Chiba 2778562, Japan
关键词
D O I
10.1093/nar/gkl723
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Exhaustive gene identification is a fundamental goal in all metagenomics projects. However, most metagenomic sequences are unassembled anonymous fragments, and conventional gene-finding methods cannot be applied. We have developed a prokaryotic gene-finding program, MetaGene, which utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures. MetaGene can predict a whole range of prokaryotic genes based on the anonymous genomic sequences of a few hundred bases, with a sensitivity of 95% and a specificity of 90% for artificial shotgun sequences (700 bp fragments from 12 species). MetaGene has two sets of codon frequency interpolations, one for bacteria and one for archaea, and automatically selects the proper set for a given sequence using the domain classification method we propose. The domain classification works properly, correctly assigning domain information to more than 90% of the artificial shotgun sequences. Applied to the Sargasso Sea dataset, MetaGene predicted almost all of the annotated genes and a notable number of novel genes. MetaGene can be applied to wide variety of metagenomic projects and expands the utility of metagenomics.
引用
收藏
页码:5623 / 5630
页数:8
相关论文
共 32 条
[1]   Self-identification of protein-coding regions in microbial genomes [J].
Audic, S ;
Claverie, JM .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (17) :10026-10031
[2]   On the convergence of a clustering algorithm for protein-coding regions in microbial genomes [J].
Baldi, P .
BIOINFORMATICS, 2000, 16 (04) :367-371
[3]   Heuristic approach to deriving models for gene finding [J].
Besemer, J ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (19) :3911-3920
[4]   GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions [J].
Besemer, J ;
Lomsadze, A ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 2001, 29 (12) :2607-2618
[5]  
BORODOVSKII MY, 1986, MOL BIOL+, V20, P1144
[6]   GENMARK - PARALLEL GENE RECOGNITION FOR BOTH DNA STRANDS [J].
BORODOVSKY, M ;
MCININCH, J .
COMPUTERS & CHEMISTRY, 1993, 17 (02) :123-133
[7]   Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii [J].
Bult, CJ ;
White, O ;
Olsen, GJ ;
Zhou, LX ;
Fleischmann, RD ;
Sutton, GG ;
Blake, JA ;
FitzGerald, LM ;
Clayton, RA ;
Gocayne, JD ;
Kerlavage, AR ;
Dougherty, BA ;
Tomb, JF ;
Adams, MD ;
Reich, CI ;
Overbeek, R ;
Kirkness, EF ;
Weinstock, KG ;
Merrick, JM ;
Glodek, A ;
Scott, JL ;
Geoghagen, NSM ;
Weidman, JF ;
Fuhrmann, JL ;
Nguyen, D ;
Utterback, TR ;
Kelley, JM ;
Peterson, JD ;
Sadow, PW ;
Hanna, MC ;
Cotton, MD ;
Roberts, KM ;
Hurst, MA ;
Kaine, BP ;
Borodovsky, M ;
Klenk, HP ;
Fraser, CM ;
Smith, HO ;
Woese, CR ;
Venter, JC .
SCIENCE, 1996, 273 (5278) :1058-1073
[8]   Bioinformatics for whole-genome shotgun sequencing of microbial communities [J].
Chen, K ;
Pachter, L .
PLOS COMPUTATIONAL BIOLOGY, 2005, 1 (02) :106-112
[9]   Improved microbial gene identification with GLIMMER [J].
Delcher, AL ;
Harmon, D ;
Kasif, S ;
White, O ;
Salzberg, SL .
NUCLEIC ACIDS RESEARCH, 1999, 27 (23) :4636-4641
[10]   The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium [J].
Eisen, JA ;
Nelson, KE ;
Paulsen, IT ;
Heidelberg, JF ;
Wu, M ;
Dodson, RJ ;
Deboy, R ;
Gwinn, ML ;
Nelson, WC ;
Haft, DH ;
Hickey, EK ;
Peterson, JD ;
Durkin, AS ;
Kolonay, JL ;
Yang, F ;
Holt, I ;
Umayam, LA ;
Mason, T ;
Brenner, M ;
Shea, TP ;
Parksey, D ;
Nierman, WC ;
Feldblyum, TV ;
Hansen, CL ;
Craven, MB ;
Radune, D ;
Vamathevan, J ;
Khouri, H ;
White, O ;
Gruber, TM ;
Ketchum, KA ;
Venter, JC ;
Tettelin, H ;
Bryant, DA ;
Fraser, CM .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (14) :9509-9514