Estimating the entropy of DNA sequences

被引:103
作者
Schmitt, AO [1 ]
Herzel, H [1 ]
机构
[1] HUMBOLDT UNIV BERLIN, INST THEORET BIOL, D-10115 BERLIN, GERMANY
关键词
D O I
10.1006/jtbi.1997.0493
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The Shannon entropy is a standard measure for the order state of symbol sequences, such as, for example, DNA sequences. In order to incorporate correlations between symbols, the entropy of n-mers (consecutive strands of n symbols) has to be determined. Here, an assay is presented to estimate such higher order entropies (block entropies) for DNA sequences when the actual number of observations is small compared with the number of possible outcomes. The n-mer probability distribution underlying the dynamical process is reconstructed using elementary statistical principles: The theorem of asymptotic equi-distribution and the Maximum Entropy Principle. Constraints are set to force the constructed distributions to adopt features which are characteristic for the real probability distribution. From the many solutions compatible with these constraints the one with the highest entropy is the most likely one according to the Maximum Entropy Principle. An algorithm performing this procedure is expounded. It is tested by applying it to various DNA model sequences whose exact entropies are known. Finally, results for a real DNA sequence, the complete genome of the Epstein Parr virus, are presented and compared with those of other information carriers (texts, computer source code, music). It seems as if DNA sequences possess much more freedom in the combination of the symbols of their alphabet than written language or computer source codes. (C) 1997 Academic Press Limited.
引用
收藏
页码:369 / 377
页数:9
相关论文
共 25 条
  • [1] [Anonymous], 1989, Maximum-entropy models in science and engineering
  • [2] DNA-SEQUENCE AND EXPRESSION OF THE B95-8 EPSTEIN-BARR VIRUS GENOME
    BAER, R
    BANKIER, AT
    BIGGIN, MD
    DEININGER, PL
    FARRELL, PJ
    GIBSON, TJ
    HATFULL, G
    HUDSON, GS
    SATCHWELL, SC
    SEGUIN, C
    TUFFNELL, PS
    BARRELL, BG
    [J]. NATURE, 1984, 310 (5974) : 207 - 211
  • [3] Cyranski John F., 1986, MAXIMUM ENTROPY BAYE
  • [4] Ebeling W., 1992, Chaos, Solitons and Fractals, V2, P635, DOI 10.1016/0960-0779(92)90058-U
  • [5] EBELING W, 1993, STAT PHYSICS THERMOD
  • [6] FINITE-SAMPLE CORRECTIONS TO ENTROPY AND DIMENSION ESTIMATES
    GRASSBERGER, P
    [J]. PHYSICS LETTERS A, 1988, 128 (6-7) : 369 - 373
  • [7] HAMMING RW, 1980, CODING INFORMATION T
  • [8] HERZEL H, 1988, SYST ANAL MODEL SIM, V5, P435
  • [9] FINITE-SAMPLE EFFECTS IN SEQUENCE-ANALYSIS
    HERZEL, H
    SCHMITT, AO
    EBELING, W
    [J]. CHAOS SOLITONS & FRACTALS, 1994, 4 (01) : 97 - 113
  • [10] ENTROPIES OF BIOSEQUENCES - THE ROLE OF REPEATS
    HERZEL, H
    EBELING, W
    SCHMITT, AO
    [J]. PHYSICAL REVIEW E, 1994, 50 (06) : 5061 - 5071