DETERMINATION OF EUKARYOTIC PROTEIN CODING REGIONS USING NEURAL NETWORKS AND INFORMATION-THEORY

被引:77
作者
FARBER, R [1 ]
LAPEDES, A [1 ]
SIROTKIN, K [1 ]
机构
[1] NIH,NATL CTR BIOTECHNOL INFORMAT,BETHESDA,MD 20892
关键词
CODING REGION; NEURAL NETS; INFORMATION THEORY; EXON; INTRON;
D O I
10.1016/0022-2836(92)90961-I
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
Our previous work applied neural network techniques to the problem of discriminating open reading frame (ORF) sequences taken from introns versus exons. The method counted the codon frequencies in an ORF of a specified length, and then used this codon frequency representation of DNA fragments to train a neural net (essentially a Perceptron with a sigmoidal, or "soft step function", output) to perform this discrimination. After training, the network was then applied to a disjoint "predict" set of data to assess accuracy. The resulting accuracy in our previous work was 98.4%, exceeding accuracies reported in the literature at that time for other algorithms. Here, we report even higher accuracies stemming from calculations of mutual information (a correlation measure) of spatiallly separated codons in exons, and in introns. Significant mutual information exists in exons, but not in introns, between adjacent codons. This suggests that dicodon frequencies of adjacent codons are important for intron/exon discrimination. We report that accuracies obtained using a neural net trained on the frequency of dicodons is significantly higher at smaller fragment lengths than even our original results using codon frequencies, which were already higher than simple statistical methods that also used codon frequencies. We also report accuracies obtained from including codon and dicodon statistics in all six reading frames, i.e. the three frames on the original and complement strand. Inclusion of six-frame statistics increases the accuracy still further. We also compare these neural net results to a Bayesian statistical prediction method that assumes independent codon frequencies in each position. The performance of the Bayesian scheme is poorer than any of the neural based schemes, however many methods reported in the literature either explicitly, or implicitly, use this method. Specifically, Bayesian prediction schemes based on codon frequencies achieve 90.9% accuracy on 90 codon ORFs, while our best neural net scheme reaches 99.4% accuracy on 60 codon ORFs. "Accuracy" is defined as the average of the exon and intron sensitivities. Achievement of sufficiently high accuracies on short fragment lengths can be useful in providing a computational means of finding coding regions in unannotated DNA sequences such as those arising from the mega-base sequencing efforts of the Human Genome Project. We caution that the high accuracies reported here do not represent a complete solution to the problem of identifying exons in "raw" base sequences. The accuracies are considerably lower from exons of small length, although still higher than accuracies reported in the literature for other methods. Short exon lengths are not uncommon. A complete solution to the problem may need a combination of methods including accurate, computational methods of identifying splice sites. © 1992.
引用
收藏
页码:471 / 479
页数:9
相关论文
共 14 条
[1]
COMPUTERS IN MOLECULAR-BIOLOGY - CURRENT APPLICATIONS AND EMERGING TRENDS [J].
DELISI, C .
SCIENCE, 1988, 240 (4848) :47-52
[2]
DUDA RO, 1983, PATTERN CLASSIFICATI
[3]
FICHANT G, 1987, COMPUT APPL BIOSCI, V3, P287
[4]
RECOGNITION OF PROTEIN CODING REGIONS IN DNA-SEQUENCES [J].
FICKETT, JW .
NUCLEIC ACIDS RESEARCH, 1982, 10 (17) :5303-5318
[5]
THE CODON PREFERENCE PLOT - GRAPHIC ANALYSIS OF PROTEIN CODING SEQUENCES AND PREDICTION OF GENE-EXPRESSION [J].
GRIBSKOV, M ;
DEVEREUX, J ;
BURGESS, RR .
NUCLEIC ACIDS RESEARCH, 1984, 12 (01) :539-549
[6]
DELINEATION OF CODING AREAS IN DNA-SEQUENCES THROUGH ASSIGNMENT OF CODON PROBABILITIES [J].
HINDS, PW ;
BLAKE, RD .
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 1985, 3 (03) :543-549
[7]
A METHOD TO LOCATE PROTEIN CODING SEQUENCES IN DNA OF PROKARYOTIC SYSTEMS [J].
KOLASKAR, AS ;
REDDY, BVB .
NUCLEIC ACIDS RESEARCH, 1985, 13 (01) :185-194
[8]
KULLBACK S, 1959, STATISTICS INFORMATI
[9]
LAPEDES A, 1989, SFI STUDIES SCI COMP, V7, P157
[10]
Press W.H., 1994, NUMERICAL RECIPES C, V2nd ed.