A fast bit-vector algorithm for approximate string matching based on dynamic programming

被引:282
作者
Myers, G [1 ]
机构
[1] Univ Arizona, Tucson, AZ USA
关键词
approximate string search; bit-parallelism; sequence comparison;
D O I
10.1145/316542.316550
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in either O(nmk/w) or O(nm lag sigma/w) time where w is the word size of the machine (e.g., 32 or 63, in practice), and sigma is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and small m. Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu ct al. [1996]. This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic programming (d.p.) matrix w entries at a time using the basic algorithm as a subroutine, is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when kim is sufficiently small.
引用
收藏
页码:395 / 415
页数:21
相关论文
共 22 条
[1]  
[Anonymous], LECT NOTES COMPUTER
[2]  
BAEZAHYATES RA, 1999, UNPUB ANAL ALGORITHM
[3]   A NEW APPROACH TO TEXT SEARCHING [J].
BAEZAYATES, R ;
GONNET, GH .
COMMUNICATIONS OF THE ACM, 1992, 35 (10) :74-82
[4]  
BAEZAYATES R, 1996, LNCS, V1075, P1
[5]   SUBLINEAR APPROXIMATE STRING-MATCHING AND BIOLOGICAL APPLICATIONS [J].
CHANG, WI ;
LAWLER, EL .
ALGORITHMICA, 1994, 12 (4-5) :327-344
[6]  
Chao K M, 1994, J Comput Biol, V1, P271, DOI 10.1089/cmb.1994.1.271
[7]  
CHAO KM, 1992, COMPUT APPL BIOSCI, V8, P481
[8]  
Cobbs AL, 1995, LECT NOTES COMPUT SC, V937, P41
[9]   AN IMPROVED ALGORITHM FOR APPROXIMATE STRING MATCHING [J].
GALIL, Z ;
PARK, K .
SIAM JOURNAL ON COMPUTING, 1990, 19 (06) :989-999
[10]  
Lampe J, 1992, LNCS, V644, P172