Boosting textual compression in optimal linear time

被引:68
作者
Ferragina, P
Giancarlo, R
Manzini, G
Sciortino, M
机构
[1] Univ Pisa, Dipartimento Informat, I-56127 Pisa, Italy
[2] Univ Palermo, I-90133 Palermo, Italy
[3] Univ Piemonte Orientale, Alessandria, Italy
关键词
algorithms; design; theory; arithmetic coding; Burrows-Wheeler transform; empirical entropy; Huffman coding; Lempel-Ziv compressors; suffix tree; text compression;
D O I
10.1145/1082036.1082043
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the "best possible" contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique' displaying these properties. Technically, our boosting technique builds upon three main ingredients: the Burrows-Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows-Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost. Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some ;of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows-Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.
引用
收藏
页码:688 / 713
页数:26
相关论文
共 35 条
[1]   Generalization of the BWT transformation and inversion ranks [J].
Arnavut, Z .
DCC 2002: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2002, :447-447
[2]   Modifications of the Burrows and Wheeler data compression algorithm [J].
Balkenhol, B ;
Kurtz, S ;
Shtarkov, YM .
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, :188-197
[3]   A LOCALLY ADAPTIVE DATA-COMPRESSION SCHEME [J].
BENTLEY, JL ;
SLEATOR, DD ;
TARJAN, RE ;
WEI, VK .
COMMUNICATIONS OF THE ACM, 1986, 29 (04) :320-330
[4]   BOUNDS ON THE REDUNDANCY OF HUFFMAN CODES [J].
CAPOCELLI, RM ;
GIANCARLO, R ;
TANEJA, IJ .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1986, 32 (06) :854-857
[5]   Unbounded length contexts for PPM [J].
Cleary, JG ;
Teahan, WJ .
COMPUTER JOURNAL, 1997, 40 (2-3) :67-75
[6]   DATA-COMPRESSION USING DYNAMIC MARKOV MODELING [J].
CORMACK, GV ;
HORSPOOL, RNS .
COMPUTER JOURNAL, 1987, 30 (06) :541-550
[7]  
COVER TM, 1990, ELEMENTS INFORMATION
[8]   A note on the Burrows-Wheeler transformation [J].
Crochemore, M ;
Désarménien, J ;
Perrin, D .
THEORETICAL COMPUTER SCIENCE, 2005, 332 (1-3) :567-572
[10]   Universal lossless source coding with the Burrows Wheeler Transform [J].
Effros, M ;
Visweswariah, K ;
Kulkarni, SR ;
Verdú, S .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2002, 48 (05) :1061-1081