IDENTIFICATION OF PROTEIN-CODING REGIONS IN GENOMIC DNA

被引:127
作者
SNYDER, EE [1 ]
STORMO, GD [1 ]
机构
[1] UNIV COLORADO, DEPT MOLEC CELLULAR & DEV BIOL, BOULDER, CO 80309 USA
关键词
GENE IDENTIFICATION; EXON STRUCTURE; CODING SEQUENCE; ARTIFICIAL INTELLIGENCE; DYNAMIC PROGRAMMING;
D O I
10.1006/jmbi.1995.0198
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We have developed a computer program, GeneParser, which identifies and determines the fine structure of protein genes in genomic DNA sequences. The program scores all subintervals in a sequence for content statistics indicative of introns and exons, and for sites that identify their boundaries. This information is weighted by a neural network to approximate the log-likelihood that each subinterval exactly represents an intron or exon (first, internal or last). A dynamic programming algorithm is then applied to this data to find the combination of introns and exons that maximizes the likelihood function. Using this method, we can rapidly generate ranked suboptimal solutions, each of which is the optimum solution containing a given intron-exon junction. We have tested the system on a large collection of human genes. On sequences not used in training, we achieved a correlation coefficient for exon nucleotide prediction of 0.89. For a subset of G + C-rich genes, a correlation coefficient of 0.94 was achieved. We have also quantified the robustness of the method to substitution and frame-shift errors and show how the system can be optimized for performance on sequences with known levels of sequencing errors.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 53 条
  • [1] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [2] [Anonymous], 1991, INTRO THEORY NEURAL, DOI DOI 10.1201/9780429499661
  • [3] BENNETZEN JL, 1982, J BIOL CHEM, V257, P3026
  • [4] BERNARDI G, 1989, ANNU REV GENET, V23, P637, DOI 10.1146/annurev.ge.23.120189.003225
  • [5] OBJECTIVE COMPARISON OF EXON AND INTRON SEQUENCES BY THE MEAN OF TWO-DIMENSIONAL DATA-ANALYSIS METHODS
    BOUGUELERET, L
    TEKAIA, F
    SAUVAGET, I
    CLAVERIE, JM
    [J]. NUCLEIC ACIDS RESEARCH, 1988, 16 (05) : 1729 - 1738
  • [6] BRIDLE JS, 1977, IEEE T ACOUST SPEECH, V27, P656
  • [7] PREDICTION OF HUMAN MESSENGER-RNA DONOR AND ACCEPTOR SITES FROM THE DNA-SEQUENCE
    BRUNAK, S
    ENGELBRECHT, J
    KNUDSEN, S
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1991, 220 (01) : 49 - 65
  • [8] CLAVERIE JM, 1990, METHOD ENZYMOL, V183, P237
  • [9] INFORMATION ENHANCEMENT METHODS FOR LARGE-SCALE SEQUENCE-ANALYSIS
    CLAVERIE, JM
    STATES, DJ
    [J]. COMPUTERS & CHEMISTRY, 1993, 17 (02): : 191 - 201
  • [10] HEURISTIC INFORMATIONAL ANALYSIS OF SEQUENCES
    CLAVERIE, JM
    BOUGUELERET, L
    [J]. NUCLEIC ACIDS RESEARCH, 1986, 14 (01) : 179 - 196