Genome annotation assessment in Drosophila melanogaster

被引:125
作者
Reese, MG
Hartzell, G
Harris, NL
Ohler, U
Abril, JF
Lewis, SE
机构
[1] Univ Calif Berkeley, Dept Mol & Cell Biol, Berkeley Drosophila Genome Project, Berkeley, CA 94720 USA
[2] Univ Erlangen Nurnberg, Chair Pattern Recognit, D-91058 Erlangen, Germany
[3] Univ Pompeu Fabra, Dept Med Informat, IMII, Barcelona 08003, Spain
关键词
D O I
10.1101/gr.10.4.483
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sires, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of Features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted For >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene Finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.
引用
收藏
页码:483 / 501
页数:19
相关论文
共 54 条
  • [11] BIRNEY E, 1999, WISE2
  • [12] Prediction of complete gene structures in human genomic DNA
    Burge, C
    Karlin, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1997, 268 (01) : 78 - 94
  • [13] Finding the genes in genomic DNA
    Burge, CB
    Karlin, S
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 1998, 8 (03) : 346 - 354
  • [14] Evaluation of gene structure prediction programs
    Burset, M
    Guigo, R
    [J]. GENOMICS, 1996, 34 (03) : 353 - 367
  • [15] Meeting review: The Second Meeting on the Critical Assessment of Techniques for Protein Structure Prediction (CASP2), Asilomar, California, December 13-16, 1996
    Dunbrack, RL
    Gerloff, DL
    Bower, M
    Chen, XW
    Lichtarge, O
    Cohen, FE
    [J]. FOLDING & DESIGN, 1997, 2 (02): : R27 - R42
  • [16] Eeckman FH, 1995, METHOD CELL BIOL, V48, P583
  • [17] Eukaryotic promoter recognition
    Fickett, JW
    Hatzigeorgiou, AC
    [J]. GENOME RESEARCH, 1997, 7 (09) : 861 - 878
  • [18] ASSESSMENT OF PROTEIN CODING MEASURES
    FICKETT, JW
    TUNG, CS
    [J]. NUCLEIC ACIDS RESEARCH, 1992, 20 (24) : 6441 - 6450
  • [19] A computer program for aligning a cDNA sequence with a genomic DNA sequence
    Florea, L
    Hartzell, G
    Zhang, Z
    Rubin, GM
    Miller, W
    [J]. GENOME RESEARCH, 1998, 8 (09) : 967 - 974
  • [20] MAGPIE: Automated genome interpretation
    Gaasterland, T
    Sensen, CW
    [J]. TRENDS IN GENETICS, 1996, 12 (02) : 76 - 78