Adding compression to block addressing inverted indexes

被引:61
作者
Navarro, G [1 ]
De Moura, ES
Neubert, M
Ziviani, N
BaezaYates, R
机构
[1] Univ Chile, Dept Comp Sci, Santiago, Chile
[2] Univ Fed Minas Gerais, Dept Comp Sci, Belo Horizonte, MG, Brazil
来源
INFORMATION RETRIEVAL | 2000年 / 3卷 / 01期
关键词
text compression; inverted files; block addressing; text databases;
D O I
10.1023/A:1009934302807
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed Fur efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work se combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text acid uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.
引用
收藏
页码:49 / 77
页数:29
相关论文
共 35 条
[1]  
[Anonymous], INFORM RETRIEVAL
[2]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[3]  
ARAUJO MD, 1997, P 4 S AM WORKSH STRI, V8, P2
[4]   A model and a visual query language for structured text [J].
Baeza-Yates, R ;
Navarro, G ;
Vegas, J ;
de la Fuente, P .
STRING PROCESSING AND INFORMATION RETRIEVAL - PROCEEDINGS: A SOUTH AMERICAN SYMPOSIUM, 1998, :7-13
[5]  
BaezaYates R, 2000, J AM SOC INFORM SCI, V51, P69, DOI 10.1002/(SICI)1097-4571(2000)51:1<69::AID-ASI10>3.0.CO
[6]  
2-C
[7]   A NEW APPROACH TO TEXT SEARCHING [J].
BAEZAYATES, R ;
GONNET, GH .
COMMUNICATIONS OF THE ACM, 1992, 35 (10) :74-82
[8]  
BAEZAYATES R, 2000, COMMUNICATION
[9]  
BAEZAYATES R, 1996, ACM SIGMOD RECORD, V25, P67
[10]  
BAEZAYATES R, 1990, P SWAT 90, P332