IDENTIFICATION OF CODING REGIONS IN GENOMIC DNA-SEQUENCES - AN APPLICATION OF DYNAMIC-PROGRAMMING AND NEURAL NETWORKS

被引:113
作者
SNYDER, EE
STORMO, GD
机构
[1] Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder
关键词
D O I
10.1093/nar/21.3.607
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
Dynamic programming (DP) is applied to the problem of precisely identifying internal exons and introns in genomic DNA sequences. The program GeneParser first scores the sequence of interest for splice sites and for these intron- and exon-specific content measures: codon usage, local compositional complexity, 6-tuple frequency, length distribution and periodic asymmetry. This information is then organized for interpretation by DP. GeneParser employs the DP algorithm to enforce the constraints that introns and exons must be adjacent and non-overlapping and finds the highest scoring combination of introns and exons subject to these constraints. Weights for the various classification procedures are determined by training a simple feed-forward neural network to maximize the number of correct predictions. In a pilot study, the system has been trained on a set of 56 human gene fragments containing 150 internal exons in a total of 158,691 bps of genomic sequence. When tested against the training data, GeneParser precisely identifies 75% of the exons and correctly predicts 86% of coding nucleotides as coding while only 13% of non-exon bps were predicted to be coding. This corresponds to a correlation coefficient for exon prediction of 0.85. Because of the simplicity of the network weighting scheme, generalization performance is nearly as good as with the training set.
引用
收藏
页码:607 / 613
页数:7
相关论文
共 26 条
[1]
[Anonymous], 1987, LEARNING INTERNAL RE
[2]
BOUGUELERET L, 1988, NUCLEIC ACIDS RES, V316, P1729
[3]
PREDICTION OF HUMAN MESSENGER-RNA DONOR AND ACCEPTOR SITES FROM THE DNA-SEQUENCE [J].
BRUNAK, S ;
ENGELBRECHT, J ;
KNUDSEN, S .
JOURNAL OF MOLECULAR BIOLOGY, 1991, 220 (01) :49-65
[4]
HEURISTIC INFORMATIONAL ANALYSIS OF SEQUENCES [J].
CLAVERIE, JM ;
BOUGUELERET, L .
NUCLEIC ACIDS RESEARCH, 1986, 14 (01) :179-196
[5]
CLAVERIE JM, 1990, METHOD ENZYMOL, V183, P237
[6]
DETERMINATION OF EUKARYOTIC PROTEIN CODING REGIONS USING NEURAL NETWORKS AND INFORMATION-THEORY [J].
FARBER, R ;
LAPEDES, A ;
SIROTKIN, K .
JOURNAL OF MOLECULAR BIOLOGY, 1992, 226 (02) :471-479
[7]
RECOGNITION OF PROTEIN CODING REGIONS IN DNA-SEQUENCES [J].
FICKETT, JW .
NUCLEIC ACIDS RESEARCH, 1982, 10 (17) :5303-5318
[8]
INFORMATION-CONTENT OF CAENORHABDITIS-ELEGANS SPLICE SITE SEQUENCES VARIES WITH INTRON LENGTH [J].
FIELDS, C .
NUCLEIC ACIDS RESEARCH, 1990, 18 (06) :1509-1512
[9]
FIELDS CA, 1990, COMPUT APPL BIOSCI, V6, P263
[10]
GRIBSKOV M, 1984, NUCLEIC ACIDS RES, V312, P529