Indexing compressed text

被引:398
作者
Ferragina, P
Manzini, G
机构
[1] Univ Pisa, Dipartimento Informat, I-56127 Pisa, Italy
[2] Univ Piemonte Orientale, Alessandria, Italy
关键词
algorithms; design; theory; Burrows-Wheeler transform; full-text indexing; indexing data structure; Lempel-Ziv compressor; pattern searching; suffix tree; suffix array; text compression;
D O I
10.1145/1082036.1082039
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P [ 1, p] within a text T [1, n] in 0 (p + occ log(1+epsilon) n) time for any chosen epsilon 0 < E < 1. This data structure uses at most 5nH(k)(T) + o(n) bits of storage, where H-k(T) is the kth order empirical entropy of T. The space usage is 6(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p + occ) query time using O(nH(k)(T) log(epsilon) n) + o(n) bits of storage for any chosen E, 0 < E < 1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm.
引用
收藏
页码:552 / 581
页数:30
相关论文
共 45 条
[1]   New data structures for orthogonal range searching [J].
Alstrup, S ;
Brodal, GS ;
Rauhe, T .
41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, :198-207
[2]  
ANDERSSON A, 1996, P 5 SCAND WORKSH ALG, V1097, P185
[3]   A LOCALLY ADAPTIVE DATA-COMPRESSION SCHEME [J].
BENTLEY, JL ;
SLEATOR, DD ;
TARJAN, RE ;
WEI, VK .
COMMUNICATIONS OF THE ACM, 1986, 29 (04) :320-330
[4]   Membership in constant time and almost-minimum space [J].
Brodnik, A ;
Munro, JI .
SIAM JOURNAL ON COMPUTING, 1999, 28 (05) :1627-1640
[5]  
Chan HL, 2004, LECT NOTES COMPUT SC, V3109, P445
[6]  
Clark DR, 1996, PROCEEDINGS OF THE SEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P383
[7]   A time and space efficient data structure for string searching on large texts [J].
Colussi, L ;
DeCol, A .
INFORMATION PROCESSING LETTERS, 1996, 58 (05) :217-222
[8]   String matching in Lempel-Ziv compressed strings [J].
Farach, M ;
Thorup, M .
ALGORITHMICA, 1998, 20 (04) :388-404
[9]   The Burrows-Wheeler transform for block sorting text compression: Principles and improvements [J].
Fenwick, PM .
COMPUTER JOURNAL, 1996, 39 (09) :731-740
[10]   Boosting textual compression in optimal linear time [J].
Ferragina, P ;
Giancarlo, R ;
Manzini, G ;
Sciortino, M .
JOURNAL OF THE ACM, 2005, 52 (04) :688-713