Using ESTs to improve the accuracy of de novo gene prediction

被引:26
作者
Wei, Chaochun
Brent, Michael R.
机构
[1] Washington Univ, Lab Computat Genom, St Louis, MO 63130 USA
[2] Washington Univ, Dept Comp Engn & Sci, St Louis, MO 63130 USA
关键词
D O I
10.1186/1471-2105-7-327
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: ESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction. Results: TWINSCAN_ EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_ EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_ EST. On the whole genome, N-SCAN_ EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN. Conclusion: TWINSCAN_ EST and N-SCAN_ EST are more accurate than TWINSCAN and NSCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available.
引用
收藏
页数:10
相关论文
共 34 条
  • [1] JIGSAW: integration of multiple sources of evidence for gene prediction
    Allen, JE
    Salzberg, SL
    [J]. BIOINFORMATICS, 2005, 21 (18) : 3596 - 3603
  • [2] Allen JE, 2004, GENOME RES, V14, P142, DOI 10.1101/gr.1562804
  • [3] GeneWise and genomewise
    Birney, E
    Clamp, M
    Durbin, R
    [J]. GENOME RESEARCH, 2004, 14 (05) : 988 - 995
  • [4] DBEST - DATABASE FOR EXPRESSED SEQUENCE TAGS
    BOGUSKI, MS
    LOWE, TMJ
    TOLSTOSHEV, CM
    [J]. NATURE GENETICS, 1993, 4 (04) : 332 - 333
  • [5] Genome annotation past, present, and future: How to define an ORF at each locus
    Brent, MR
    [J]. GENOME RESEARCH, 2005, 15 (12) : 1777 - 1786
  • [6] Begin at the beginning:: Predicting genes with 5′ UTRs
    Brown, RH
    Gross, SS
    Brent, MR
    [J]. GENOME RESEARCH, 2005, 15 (05) : 742 - 747
  • [7] Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map
    Flicek, P
    Keibler, E
    Hu, P
    Korf, I
    Brent, MR
    [J]. GENOME RESEARCH, 2003, 13 (01) : 46 - 54
  • [8] Integrating alternative splicing detection into gene prediction
    Foissac, S
    Schiex, T
    [J]. BMC BIOINFORMATICS, 2005, 6 (1)
  • [9] Gerhard DS, 2004, GENOME RES, V14, P2121, DOI 10.1101/gr.2596504
  • [10] Genome sequence of the Brown Norway rat yields insights into mammalian evolution
    Gibbs, RA
    Weinstock, GM
    Metzker, ML
    Muzny, DM
    Sodergren, EJ
    Scherer, S
    Scott, G
    Steffen, D
    Worley, KC
    Burch, PE
    Okwuonu, G
    Hines, S
    Lewis, L
    DeRamo, C
    Delgado, O
    Dugan-Rocha, S
    Miner, G
    Morgan, M
    Hawes, A
    Gill, R
    Holt, RA
    Adams, MD
    Amanatides, PG
    Baden-Tillson, H
    Barnstead, M
    Chin, S
    Evans, CA
    Ferriera, S
    Fosler, C
    Glodek, A
    Gu, ZP
    Jennings, D
    Kraft, CL
    Nguyen, T
    Pfannkoch, CM
    Sitter, C
    Sutton, GG
    Venter, JC
    Woodage, T
    Smith, D
    Lee, HM
    Gustafson, E
    Cahill, P
    Kana, A
    Doucette-Stamm, L
    Weinstock, K
    Fechtel, K
    Weiss, RB
    Dunn, DM
    Green, ED
    [J]. NATURE, 2004, 428 (6982) : 493 - 521