When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications

被引:30
作者
Foschini, Luca [1 ]
Grossi, Roberto [2 ]
Gupta, Ankur [3 ]
Vitter, Jeffrey Scott [4 ]
机构
[1] Scuola Super Sant Anna, I-56127 Pisa, Italy
[2] Univ Pisa, Dipartimento Informat, I-56127 Pisa, Italy
[3] Duke Univ, Dept Comp Sci, Ctr Geometr & Biol Comp, Durham, NC 27708 USA
[4] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
关键词
Algorithms; Design; Experimentation; Theory; Entropy; text indexing; Burrows-Wheeler Transform; suffix array;
D O I
10.1145/1198513.1198521
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previouswork and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size-without requiring a separate instance of the text. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows-Wheeler transform), achieving a compression ratio comparable to that of bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.
引用
收藏
页数:29
相关论文
共 42 条
[11]  
FENWICK P., 1996, 137 TR U AUCKL
[12]   Indexing compressed text [J].
Ferragina, P ;
Manzini, G .
JOURNAL OF THE ACM, 2005, 52 (04) :552-581
[13]   Boosting textual compression in optimal linear time [J].
Ferragina, P ;
Giancarlo, R ;
Manzini, G ;
Sciortino, M .
JOURNAL OF THE ACM, 2005, 52 (04) :688-713
[14]  
Ferragina P, 2001, SIAM PROC S, P269
[15]  
FOSCHINI L., 2004, P IEEE DAT COMPR C S
[16]  
Gonnet G.H., 1992, INFORMATION RETRIEVA, P66
[17]   Compressed suffix arrays and suffix trees with applications to text indexing and string matching [J].
Grossi, R ;
Vitter, JS .
SIAM JOURNAL ON COMPUTING, 2005, 35 (02) :378-407
[18]  
Grossi R., 2004, P 15 ANN ACM SIAM S
[19]  
Hon Wing-Kai, 2004, ALENEXANALC
[20]  
Hon WK, 2003, ANN IEEE SYMP FOUND, P251