On-line and off-line heuristics for inferring hierarchies of repetitions in sequences

被引:27
作者
Nevill-Manning, CG [1 ]
Witten, IH
机构
[1] Rutgers State Univ, Dept Comp Sci, Piscataway, NJ 08854 USA
[2] Univ Waikato, Dept Comp Sci, Hamilton, New Zealand
关键词
dictionary compression; grammar inference; hierarchical compression;
D O I
10.1109/5.892710
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Hierarchical dictionary-based compression schemes form a grammar for a text by replacing each repeated string with a production rule. While such schemes usually operate on-line, making a replacement as soon as repetition is detected, off-line operation permits greater freedom in choosing the order of replacement. In this paper, we compare the on-line method with three off-line heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the grammar locally. Surprisingly, two of the off-line techniques, like the on-line method, run in time linear in the size of the input. We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the on-line technique, and, lagging by some distance, the longest-first technique.
引用
收藏
页码:1745 / 1755
页数:11
相关论文
共 16 条
[1]  
Abelson H., 1980, TURTLE GEOMETRY
[2]   Data compression using long common strings [J].
Bentley, J ;
McIlroy, D .
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, :287-295
[3]   SOURCE ENCODING USING SYNTACTIC INFORMATION SOURCE MODELS [J].
CAMERON, RD .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1988, 34 (04) :843-850
[4]  
Gusfield D, 1997, ALGORITHMS STRINGS T
[5]   Grammar-based codes: A new class of universal lossless source codes [J].
Kieffer, JC ;
Yang, EH .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2000, 46 (03) :737-754
[6]   Offline dictionary-based compression [J].
Larsson, NJ ;
Moffat, A .
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, :296-305
[7]   Whole program paths [J].
Larus, JR .
ACM SIGPLAN NOTICES, 1999, 34 (05) :259-269
[8]  
MARTIN AR, 1999, THESIS U TORONTO TOR
[9]  
MILLER VS, 1985, SER NATO ASI F, V12, P131
[10]   Phrase hierarchy inference and compression in bounded space [J].
Nevill-Manning, CG ;
Witten, IH .
DCC '98 - DATA COMPRESSION CONFERENCE, 1998, :179-188