String matching in Lempel-Ziv compressed strings

被引:52
作者
Farach, M [1 ]
Thorup, M
机构
[1] Rutgers State Univ, Dept Comp Sci, Piscataway, NJ 08855 USA
[2] Univ Copenhagen, Dept Comp Sci, DK-2100 Copenhagen, Denmark
关键词
string matching; compression; algorithms;
D O I
10.1007/PL00009202
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
String matching and compression are two widely studied arena of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel-Ziv (LZ77) algorithm. intuitively, once a string has been compressed-and therefore its repetitive nature has been elucidated--one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T, and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T, given only P and Z. Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if \Z\ = o(\T\), i.e., if the compression ratio \T\/\Z\ is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(N log(2) U/N+P) where N = \Z\, U = \T\, and P = \P\. Compare with the naive "decompresion" algorithm, which takes time Theta(U + P) to decide of P occurs in T. Writing U + P as N . U/N + P, we see that we have improved the complexity, replacing the compression factor U/N by a factor log(2) U/N. Our algorithm is competitive in the sense that O(N log(2) U/N + P) = O(U + P), and opportunistic in the sense that O(N log(2) U/N + P) = o(U + P) if N = o(U) and P = o(U).
引用
收藏
页码:388 / 404
页数:17
相关论文
共 16 条
[1]  
Aho A.V., 1974, The Design and Analysis of Computer Algorithms
[2]  
AMIR A, 1994, PROCEEDINGS OF THE FIFTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P705
[3]  
Amir A., 1992, P 2 IEEE DAT COMPR C, P279
[4]  
AMIR A, 1994, P 21 INT C AUT LANG, P215
[5]  
[Anonymous], 1988, DATA COMPRESSION MET
[6]  
Corman T., 1990, INTRO ALGORITHMS
[7]  
Farach M., 1995, P 7 ANN ACM S PAR AL, P244
[8]  
GU M, 1994, PROCEEDINGS OF THE FIFTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P697
[9]   EFFICIENT RANDOMIZED PATTERN-MATCHING ALGORITHMS [J].
KARP, RM ;
RABIN, MO .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1987, 31 (02) :249-260
[10]  
KOSARAJU SR, 1995, P 10 ANN C FDN SOFTW