EGASP:: the human ENCODE genome annotation assessment project

被引:84
作者
Guigo, Roderic [1 ]
Flicek, Paul
Abril, Josep F.
Reymond, Alexandre
Lagarde, Julien
Denoeud, France
Antonarakis, Stylianos
Ashburner, Michael
Bajic, Vladimir B.
Birney, Ewan
Castelo, Robert
Eyras, Eduardo
Ucla, Catherine
Gingeras, Thomas R.
Harrow, Jennifer
Hubbard, Tim
Lewis, Suzanna E.
Reese, Martin G.
机构
[1] Univ Pompeu Fabra, Ctr Reg Genom, Inst Municipal Invest Med, E-08003 Barcelona, Spain
[2] European Bioinformat Inst, Cambridge CB10 1SD, England
[3] Univ Lausanne, Ctr Integrat Genom, Lausanne, Switzerland
[4] Univ Geneva, Sch Med, Univ Hosp Geneva, CH-1211 Geneva, Switzerland
[5] Univ Cambridge, Dept Genet, Cambridge CB2 3EH, England
[6] Univ Western Cape, S African Natl Bioinformat Inst, ZA-7535 Bellville, South Africa
[7] Affymetrix Inc, Santa Clara, CA 95051 USA
[8] Wellcome Trust Sanger Inst, Cambridge CB10 1SA, England
[9] Univ Calif Berkeley, Dept Mol & Cellular Biol, Berkeley, CA 94792 USA
[10] Omicia Inc, Emeryville, CA 94608 USA
关键词
D O I
10.1186/gb-2006-7-s1-s2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 [微生物学]; 0836 [生物工程]; 090102 [作物遗传育种]; 100705 [微生物与生化药学];
摘要
Background: We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. Results: The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. Conclusions: This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASPI, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
引用
收藏
页数:31
相关论文
共 73 条
[1]
SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model [J].
Alexandersson, M ;
Cawley, S ;
Pachter, L .
GENOME RESEARCH, 2003, 13 (03) :496-502
[2]
ALLEN JE, 2006, GENOME BIOL S1, V7, P9
[3]
[Anonymous], 2009, BIOINFORMATICS, V26, P139, DOI [10.1093/bioinformatics/btp616, DOI 10.1093/BIOINFORMATICS/BTP616, DOI 10.1093/BIOINFORMATICS/17.SUPPL_1.S140]
[4]
Pairagon plus N-SCAN_EST: a model-based gene annotation pipeline [J].
Arumugam, Manimozhiyan ;
Wei, Chaochun ;
Brown, Randall H. ;
Brent, Michael R. .
GENOME BIOLOGY, 2006, 7 (Suppl 1)
[5]
Ashburner M, 1999, GENETICS, V153, P179
[6]
Bajic V B, 2000, Brief Bioinform, V1, P214, DOI 10.1093/bib/1.3.214
[7]
BAJIC VB, 2006, GENOME BIOL S1, V7, P3
[8]
Assessing the accuracy of prediction algorithms for classification: an overview [J].
Baldi, P ;
Brunak, S ;
Chauvin, Y ;
Andersen, CAF ;
Nielsen, H .
BIOINFORMATICS, 2000, 16 (05) :412-424
[9]
MaskerAid:: a performance enhancement to RepeatMasker [J].
Bedell, JA ;
Korf, I ;
Gish, W .
BIOINFORMATICS, 2000, 16 (11) :1040-1041
[10]
GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses [J].
Besemer, J ;
Borodovsky, M .
NUCLEIC ACIDS RESEARCH, 2005, 33 :W451-W454