Faster approximate string matching

被引:101
作者
BaezaYates, R [1 ]
Navarro, G [1 ]
机构
[1] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
text searching allowing errors; bit-parallelism;
D O I
10.1007/PL00009253
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Omega(log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)), where m is the pattern length and k < m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O (log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost. We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to O(root mk/w n). Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near O(root mk/(sigma w) n) average complexity (sigma is the alphabet size). We perform a complete analysis of all the techniques and show how to combine them in an optimal form, also obtaining new tighter bounds for the probability of an approximate occurrence in random text. Finally, we show experimental results comparing our algorithms against previous work. These experiments show that our algorithms are among the fastest for typical text searching, being the fastest in some cases. Although we aim mainly at text searching, we believe that our ideas can be successfully applied to other areas such as computational biology.
引用
收藏
页码:127 / 158
页数:32
相关论文
共 36 条
[1]   EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH [J].
AHO, AV ;
CORASICK, MJ .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :333-340
[2]  
[Anonymous], J MOL BIOL
[3]  
Baeza-Yates R., 1996, Proceedings of the Third South American Workshop on String Processing. WSP 1996, P47
[4]  
Baeza-Yates RA, 1992, P IFIP 12 C MADR, V1, P465
[5]  
BaezaYates R, 1997, LECT NOTES COMPUT SC, V1272, P174
[6]   A NEW APPROACH TO TEXT SEARCHING [J].
BAEZAYATES, R ;
GONNET, GH .
COMMUNICATIONS OF THE ACM, 1992, 35 (10) :74-82
[7]  
BAEZAYATES R, 1996, LNCS, V1075, P1
[8]   Fast and practical approximate string matching [J].
BaezaYates, RA ;
Perleberg, CH .
INFORMATION PROCESSING LETTERS, 1996, 59 (01) :21-27
[9]  
CHANG W, 1994, LECT NOTES COMPUT SC, V807, P259
[10]   SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS [J].
CHANG, WI ;
LAWLER, EL .
ALGORITHMICA, 1994, 12 (4-5) :327-344