Protein is incompressible

被引:80
作者
Nevill-Manning, CG [1 ]
Witten, IH [1 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08855 USA
来源
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS | 1999年
关键词
D O I
10.1109/DCC.1999.755675
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown-that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.
引用
收藏
页码:257 / 266
页数:4
相关论文
共 11 条
[11]  
WITTEN IH, 1999, UNPUB DAT COMPR C SN