When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications

被引:30
作者
Foschini, Luca [1 ]
Grossi, Roberto [2 ]
Gupta, Ankur [3 ]
Vitter, Jeffrey Scott [4 ]
机构
[1] Scuola Super Sant Anna, I-56127 Pisa, Italy
[2] Univ Pisa, Dipartimento Informat, I-56127 Pisa, Italy
[3] Duke Univ, Dept Comp Sci, Ctr Geometr & Biol Comp, Durham, NC 27708 USA
[4] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
关键词
Algorithms; Design; Experimentation; Theory; Entropy; text indexing; Burrows-Wheeler Transform; suffix array;
D O I
10.1145/1198513.1198521
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previouswork and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size-without requiring a separate instance of the text. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows-Wheeler transform), achieving a compression ratio comparable to that of bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.
引用
收藏
页数:29
相关论文
共 42 条
[1]  
Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
[2]  
[Anonymous], 2003, P 14 ANN ACM SIAM S
[3]  
[Anonymous], 1997, ALGORITHMS STRINGS T, DOI DOI 10.1017/CBO9780511574931
[4]  
ARIMURA H., 2001, CPM
[5]   The Level Ancestor Problem simplified [J].
Bender, MA ;
Farach-Colton, M .
THEORETICAL COMPUTER SCIENCE, 2004, 321 (01) :5-12
[6]   A LOCALLY ADAPTIVE DATA-COMPRESSION SCHEME [J].
BENTLEY, JL ;
SLEATOR, DD ;
TARJAN, RE ;
WEI, VK .
COMMUNICATIONS OF THE ACM, 1986, 29 (04) :320-330
[7]   Membership in constant time and almost-minimum space [J].
Brodnik, A ;
Munro, JI .
SIAM JOURNAL ON COMPUTING, 1999, 28 (05) :1627-1640
[8]   Fractional Cascading: I. A Data Structuring Technique [J].
Chazelle, Bernard ;
Guibas, Leonidas J. .
ALGORITHMICA, 1986, 1 (1-4) :133-162
[10]   Burrows-Wheeler compression with variable length integer codes [J].
Fenwick, P .
SOFTWARE-PRACTICE & EXPERIENCE, 2002, 32 (13) :1307-1316