Burst tries: A fast, efficient data structure for string keys

被引:77
作者
Heinz, S [1 ]
Zobel, J [1 ]
Williams, HE [1 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic 3001, Australia
关键词
algorithms; binary trees; splay trees; string data structures; text databases; tries; vocabulary accumulation;
D O I
10.1145/506309.506312
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it uses about the same memory as a binary search tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
引用
收藏
页码:192 / 223
页数:32
相关论文
共 55 条
[31]   SPACE-ECONOMICAL SUFFIX TREE CONSTRUCTION ALGORITHM [J].
MCCREIGHT, EM .
JOURNAL OF THE ACM, 1976, 23 (02) :262-272
[32]   PATRICIA - PRACTICAL ALGORITHM TO RETRIEVE INFORMATION CODED IN ALPHANUMERIC [J].
MORRISON, DR .
JOURNAL OF THE ACM, 1968, 15 (04) :514-&
[33]   Protein is incompressible [J].
Nevill-Manning, CG ;
Witten, IH .
DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, :257-266
[34]   IP-address lookup using LC-tries [J].
Nilsson, S ;
Karlsson, G .
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 1999, 17 (06) :1083-1092
[35]   COMPUTER-PROGRAMS FOR DETECTING AND CORRECTING SPELLING-ERRORS [J].
PETERSON, JL .
COMMUNICATIONS OF THE ACM, 1980, 23 (12) :676-687
[36]  
Purdin T. D. M., 1990, Proceedings of the 1990 Symposium on Applied Computing (Cat. No.90TH0307-9), P336, DOI 10.1109/SOAC.1990.82193
[37]   LIMITING DISTRIBUTION FOR THE DEPTH IN PATRICIA TRIES [J].
RAIS, B ;
JACQUET, P ;
SZPANKOWSKI, W .
SIAM JOURNAL ON DISCRETE MATHEMATICS, 1993, 6 (02) :197-213
[38]  
RAMAKRISHNA MV, 1997, P INT C DAT SYST ADV, P215
[39]   VARIABLE-DEPTH TRIE INDEX OPTIMIZATION - THEORY AND EXPERIMENTAL RESULTS [J].
RAMESH, R ;
BABU, AJG ;
KINCAID, JP .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1989, 14 (01) :41-74
[40]   NEW RESULTS ON THE SIZE OF TRIES [J].
REGNIER, M ;
JACQUET, P .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1989, 35 (01) :203-205