Protein is incompressible

被引:80
作者
Nevill-Manning, CG [1 ]
Witten, IH [1 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08855 USA
来源
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS | 1999年
关键词
D O I
10.1109/DCC.1999.755675
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown-that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.
引用
收藏
页码:257 / 266
页数:4
相关论文
共 11 条
[1]   Bagging predictors [J].
Breiman, L .
MACHINE LEARNING, 1996, 24 (02) :123-140
[2]  
Bunton S., 1996, THESIS U WASHINGTON
[3]  
CLARY JG, 1995, P MACH LEARN C TAH C, P108
[4]   DATA-COMPRESSION USING ADAPTIVE CODING AND PARTIAL STRING MATCHING [J].
CLEARY, JG ;
WITTEN, IH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1984, 32 (04) :396-402
[5]  
Friedman J., 1998, ADDITIVE LOGISTIC RE
[6]   AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) :10915-10919
[7]   Significantly lower entropy estimates for natural DNA sequences [J].
Loewenstern, D ;
Yianilos, PN .
DCC '97 : DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1997, :151-160
[8]   Compression and explanation using hierarchical grammars [J].
NevillManning, CG ;
Witten, IH .
COMPUTER JOURNAL, 1997, 40 (2-3) :103-116
[9]   THE SWISS-3DIMAGE COLLECTION AND PDB-BROWSER ON THE WORLDWIDE WEB [J].
PEITSCH, MC ;
WELLS, TNC ;
STAMPF, DR ;
SUSSMAN, JL .
TRENDS IN BIOCHEMICAL SCIENCES, 1995, 20 (02) :82-84
[10]   Correcting English text using PPM models [J].
Teahan, WJ ;
Inglis, S ;
Cleary, JG ;
Holmes, G .
DCC '98 - DATA COMPRESSION CONFERENCE, 1998, :289-298