Compact recognizers of episode sequences

被引:12
作者
Apostolico, A
Atallah, MJ
机构
[1] Univ Padua, Dipartimento Elettron & Informat, I-35131 Padua, Italy
[2] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[3] Purdue Univ, CERIAS Lab, W Lafayette, IN 47907 USA
基金
美国国家科学基金会; 英国工程与自然科学研究理事会;
关键词
algorithms; pattern matching; subsequence and episode searching; DAWG; suffix automaton; compact subsequence automaton; skip-edge DAWG; forward failure function; skip-link;
D O I
10.1006/inco.2002.3143
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Given two strings X = a(1) ... a(n) and P = b(1) ... b(m) over an alphabet Sigma, the problem of testing whether P occurs as a subsequence of X is trivially solved in linear time. It is also known that a simple O(n log \Sigma\) time preprocessing of X makes it easy to decide subsequently, for any P and in at most \P\log\Sigma\ character comparisons, whether P is a subsequence of X. These problems become more complicated if one asks instead whether P occurs as a subsequence of some substring Y off of bounded length. This paper presents an automaton built on the textstring X and capable of identifying all distinct minimal substrings Y of X having P as a subsequence. By a substring Y being minimal with respect to P, it is meant that P is not a subsequence of any proper substring of Y. For every minimal substring Y. the automaton recognizes the occurrence of P having the lexicographically smallest sequence of symbol positions in Y. It is not difficult to realize such an automaton in time and space O(n(2)) for a text of ncharacters. One result of this paper consists of bringing those bounds down to linear or O (n log n), respectively, depending on whether the alphabet is bounded or of arbitrary size. thereby matching the corresponding complexities of automata constructions for offline exact string searching. Having built the automaton, the search for all lexicographically earliest Occurrences of P in X is carried out in time O(Sigma(i=1)(m) rocc(i) . i) or O(n + Sigma(i=1)(m)rocc(i) . i . log n) depending on whether the alphabet is fixed or arbitrary, where rocci is the number of distinct minimal substrings of X having b(1),...b(i) as a subsequence (note that each such substring, may occur many times in X but is counted only once in the bound). All log factors appearing in the above bounds can be further reduced to log log by resorting to known integer-handling data structures. (C) 2002 Elsevier, Science (USA).
引用
收藏
页码:180 / 192
页数:13
相关论文
共 12 条
[1]  
[Anonymous], LECT NOTES COMPUTER
[2]   THE LONGEST COMMON SUBSEQUENCE PROBLEM REVISITED [J].
APOSTOLICO, A ;
GUERRA, C .
ALGORITHMICA, 1987, 2 (03) :315-336
[3]  
APOSTOLICO A, 1967, PATTERN MATCHING ALG
[4]  
BAEZAYATES R, 1991, THEORET COMPUT SCI, V78, P373
[5]   THE SMALLEST AUTOMATION RECOGNIZING THE SUBWORDS OF A TEXT [J].
BLUMER, A ;
BLUMER, J ;
HAUSSLER, D ;
EHRENFEUCHT, A ;
CHEN, MT ;
SEIFERAS, J .
THEORETICAL COMPUTER SCIENCE, 1985, 40 (01) :31-55
[6]  
Cobbs AL, 1995, LECT NOTES COMPUT SC, V937, P41
[7]  
Crochemore M., 1994, TEXT ALGORITHMS
[8]  
Das G, 1997, LECT NOTES COMPUT SC, V1264, P12
[9]  
JOHNSON DB, 1982, MATH SYST THEORY, V15, P295
[10]  
Kumar Sandeep., 1994, P 17 NATL COMPUTER S, P11