ESTIMATION OF PROTEIN-CODING DENSITY IN A CORPUS OF DNA-SEQUENCE DATA

被引:6
作者
FICKETT, JW [1 ]
GUIGO, R [1 ]
机构
[1] LOS ALAMOS NATL LAB,CTR HUMAN GENOME STUDIES,LOS ALAMOS,NM 87545
关键词
D O I
10.1093/nar/21.12.2837
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C.elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.
引用
收藏
页码:2837 / 2844
页数:8
相关论文
共 20 条
[1]  
AGHA M, 1983, APPL STAT, V33, P327
[2]   ELECTRONIC DATA PUBLISHING AND GENBANK [J].
CINKOSKY, MJ ;
FICKETT, JW ;
GILNA, P ;
BURKS, C .
SCIENCE, 1991, 252 (5010) :1273-1277
[3]  
CLARK DV, 1988, GENETICS, V119, P345
[4]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[5]  
FICKETT JW, 1992, NUCLEIC ACIDS RES, V24, P6441
[7]   ESTIMATION OF PARMAETERS FOR A MIXTURE OF NORMAL DISTRIBUTIONS [J].
HASSELBLAD, V .
TECHNOMETRICS, 1966, 8 (03) :431-+
[8]   THE EMBL DATA LIBRARY [J].
HIGGINS, DG ;
FUCHS, R ;
STOEHR, PJ ;
CAMERON, GN .
NUCLEIC ACIDS RESEARCH, 1992, 20 :2071-2074
[9]   IMPROVED METHODS FOR THE FORMATION AND STABILIZATION OF R-LOOPS [J].
KABACK, DB ;
ANGERER, LM ;
DAVIDSON, N .
NUCLEIC ACIDS RESEARCH, 1979, 6 (07) :2499-2517
[10]  
Manly BFJ, 1986, MULTIVARIATE STATIST