A NEW CHALLENGE FOR COMPRESSION ALGORITHMS - GENETIC SEQUENCES

被引:152
作者
GRUMBACH, S
TAHI, F
机构
[1] I.N.R.I.A., Rocquencourt, 78153 Le Chesnay
关键词
D O I
10.1016/0306-4573(94)90014-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of ''text.'' We analyze in some detail the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, leads to the highest compression of DNA. The results, although not satisfactory, give insight to the necessary correlation between compression and comprehension of genetic sequences.
引用
收藏
页码:875 / 886
页数:12
相关论文
共 18 条
  • [1] APOSTOLICO A, 1992, UNPUBG OPTIMAL PARAL
  • [2] PERIODICITIES IN CODING AND NONCODING REGIONS OF THE GENES
    ARQUES, DG
    MICHEL, CJ
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 1990, 143 (03) : 307 - 318
  • [3] Bennett C.H., 1988, CHAPTER LOGICAL DEPT, P227
  • [4] STATISTICAL-ANALYSIS OF DEOXYRIBONUCLEIC-ACID SEQUENCE DATA - A REVIEW
    CURNOW, RN
    KIRKWOOD, TBL
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 1989, 152 : 199 - 220
  • [5] DATA-COMPRESSION WITH FINITE WINDOWS
    FIALA, ER
    GREENE, DH
    [J]. COMMUNICATIONS OF THE ACM, 1989, 32 (04) : 490 - 505
  • [6] GRUMBACH S, 1993, P IEEE S DATA COMPRE
  • [7] A METHOD FOR THE CONSTRUCTION OF MINIMUM-REDUNDANCY CODES
    HUFFMAN, DA
    [J]. PROCEEDINGS OF THE INSTITUTE OF RADIO ENGINEERS, 1952, 40 (09): : 1098 - 1101
  • [8] KAMEL N, 1991, P INT C VERY LARGE D
  • [9] KOLMOGOROV A. N., 1965, PROBL PEREDACHI INF, V1, P3
  • [10] Lelewer D. A., 1987, ACM COMPUT SURV, V19, P261