Space efficient suffix trees

被引:77
作者
Munro, JL [1 ]
Raman, V
Rao, SS
机构
[1] Univ Waterloo, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
[2] Inst Math Sci, Chennai 600113, India
来源
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC | 2001年 / 39卷 / 02期
关键词
D O I
10.1006/jagm.2000.1151
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We give first the representation of a suffix tree that uses n lg n + O(n) bits of space and supports searching for a pattern string in the given text (from a fixed size alphabet) in O(m) time, where n is the size of the text and m is the length of the pattern. The structure is quite simple and answers a question raised by Muthukrishnan (in Proceedings of the FST and TCS Preconference Workshop on Randomization, 1997, pp. 23-27). Previous compact representations of suffix trees had either a higher lower order term in space and had some expectation assumption or required more time for searching. When the size of the alphabet k is not viewed as a constant, this structure can be modified to use the same space but take O(rn Ig k) time for string searching or to use an additional O(n Ig k) hits but take the same O(m) time for searching. We then give several index structures for binary texts, with less space including . a structure that uses a suffix array (n inverted right perpendicular lg n inverted left perpendicular bits) and an additional o(n) bits, . an indexing structure that takes n/2 lg n + O(n) bits of space, and . an o(n lg n) bit structure which answers in O(m) time, the decision question of whether a given pattern of length nr occurs in the text. Each of these structures uses a different technique, either in the storage scheme or in the search algorithm, in reducing the space requirement. The first one uses a suffix array, a sparse suffix tree, and a table structure. Finding all the occurrences of a pattern using this structure takes O(m + s) time, where s is the number of occurrences of the pattern in the test. The second structure constructs a sparse suffix tree for all the suffixes that start with the bit that occurs more times in the given binary text. The last structure uses an iterative algorithm to search for the pattern. This structure is the first o(n lg n) bit index to support the decision version of indexing queries in time linens in the length of the pattern. But this does not support the general indexing queries where we want to find the position of the occurrence of the pattern. Our main contribution is the development of techniques to use the succinct tree representation through balanced parentheses for suffix trees. (C) 2001 Academic Press.
引用
收藏
页码:205 / 222
页数:18
相关论文
共 27 条
[1]   STRUCTURAL-PROPERTIES OF THE STRING STATISTICS PROBLEM [J].
APOSTOLICO, A ;
PREPARATA, FP .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1985, 31 (03) :394-411
[2]  
Benoit D, 1999, LECT NOTES COMPUT SC, V1663, P169
[3]  
BENOIT D, 1998, THESIS U WATERLOO CA
[4]   COMPLETE INVERTED FILES FOR EFFICIENT TEXT RETRIEVAL AND ANALYSIS [J].
BLUMER, A ;
BLUMER, J ;
HAUSSLER, D ;
MCCONNELL, R ;
EHRENFEUCHT, A .
JOURNAL OF THE ACM, 1987, 34 (03) :578-595
[5]   ANALYSIS AND PERFORMANCE OF INVERTED DATA BASE STRUCTURES [J].
CARDENAS, AF .
COMMUNICATIONS OF THE ACM, 1975, 18 (05) :253-263
[6]  
Clark D.R., 1996, Ph.D. thesis
[7]  
Clark DR, 1996, PROCEEDINGS OF THE SEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P383
[8]   SEQUENCE LANDSCAPES [J].
CLIFT, B ;
HAUSSLER, D ;
MCCONNELL, R ;
SCHNEIDER, TD ;
STORMO, GD .
NUCLEIC ACIDS RESEARCH, 1986, 14 (01) :141-158
[9]   A time and space efficient data structure for string searching on large texts [J].
Colussi, L ;
DeCol, A .
INFORMATION PROCESSING LETTERS, 1996, 58 (05) :217-222
[10]  
FRASER CW, 1984, P SIGPLAN 84 S COMP, P117