mGene: Accurate SVM-based gene finding with an application to nematode genomes

被引:55
作者
Schweikert, Gabriele [1 ,2 ,3 ]
Zien, Alexander [1 ,4 ]
Zeller, Georg [1 ,3 ]
Behr, Jonas [1 ]
Dieterich, Christoph [3 ]
Ong, Cheng Soon [1 ,2 ]
Philips, Petra [1 ]
De Bona, Fabio [1 ]
Hartmann, Lisa [1 ]
Bohlen, Anja [1 ]
Krueger, Nina [1 ]
Sonnenburg, Soeren [1 ,4 ]
Raetsch, Gunnar [1 ]
机构
[1] Max Planck Gesell, Friedrich Miescher Lab, D-72076 Tubingen, Germany
[2] Max Planck Inst Biol Cybernet, D-72076 Tubingen, Germany
[3] Max Planck Inst Dev Biol, D-72076 Tubingen, Germany
[4] Fraunhofer Inst FIRST IDA, D-12489 Berlin, Germany
关键词
RECOGNITION; PREDICTION; ORTHOLOGS; ENCODE; EXONS;
D O I
10.1101/gr.090597.108
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximate to 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.
引用
收藏
页码:2133 / 2143
页数:11
相关论文
共 54 条
[1]   Automatic clustering of orthologs and inparalogs shared by multiple proteomes [J].
Alexeyenko, Andrey ;
Tamas, Ivica ;
Liu, Gang ;
Sonnhammer, Erik L. L. .
BIOINFORMATICS, 2006, 22 (14) :E9-E15
[2]   JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions [J].
Allen, Jonathan E. ;
Majoros, William H. ;
Pertea, Mihaela ;
Salzberg, Steven L. .
GENOME BIOLOGY, 2006, 7 (Suppl 1)
[3]  
Altun Y., 2003, Hidden markov support vector machines
[4]   Support Vector Machines and Kernels for Computational Biology [J].
Ben-Hur, Asa ;
Ong, Cheng Soon ;
Sonnenburg, Soeren ;
Schoelkopf, Bernhard ;
Raetsch, Gunnar .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (10)
[5]   Global discriminative learning for higher-accuracy computational gene prediction [J].
Bernal, Axel ;
Crammer, Koby ;
Hatzigeorgiou, Artemis ;
Pereira, Fernando .
PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (03) :488-497
[6]   Steady progress and recent breakthroughs in the accuracy of automated genome annotation [J].
Brent, Michael R. .
NATURE REVIEWS GENETICS, 2008, 9 (01) :62-73
[7]   Begin at the beginning:: Predicting genes with 5′ UTRs [J].
Brown, RH ;
Gross, SS ;
Brent, MR .
GENOME RESEARCH, 2005, 15 (05) :742-747
[8]   Finding the genes in genomic DNA [J].
Burge, CB ;
Karlin, S .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 1998, 8 (03) :346-354
[9]   Genome sequence of the nematode C-elegans:: A platform for investigating biology [J].
不详 .
SCIENCE, 1998, 282 (5396) :2012-2018
[10]   nGASP - the nematode genome annotation assessment project [J].
Coghlan, Avril ;
Fiedler, Tristan J. ;
Mckay, Sheldon J. ;
Flicek, Paul ;
Harris, Todd W. ;
Blasiar, Darin ;
Stein, Lincoln D. .
BMC BIOINFORMATICS, 2008, 9 (1)