Probabilistic Static Pruning of Inverted Files

被引:22
作者
Blanco, Roi [1 ]
Barreiro, Alvaro [1 ]
机构
[1] Univ A Coruna, Dept Comp Sci, IRLab, La Coruna, Spain
关键词
Algorithms; Experimentation; Performance; Pruning; inverted files; probabilistic models; compression; efficiency; MODELS; RETRIEVAL;
D O I
10.1145/1658377.1658378
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content. We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion. We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.
引用
收藏
页数:33
相关论文
共 46 条
[1]
Probabilistic models of information retrieval based on measuring the divergence from randomness [J].
Amati, G ;
Van Rijsbergen, CJ .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (04) :357-389
[2]
Amati G, 2006, LECT NOTES COMPUT SC, V3936, P13
[3]
Anh V. N., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P372, DOI 10.1145/1148170.1148235
[4]
[Anonymous], P SIGIR
[5]
[Anonymous], 1998, Computer networks and ISDN systems, DOI [10.1016/S0169-7552(98)00110-X, DOI 10.1016/S0169-7552(98)00110-X]
[6]
Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
[7]
Baeza-Yates R., 2007, P 30 ANN INT ACM SIG, P183, DOI DOI 10.1145/1277741.1277775
[8]
Ben HE., 2003, Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM '03, P10
[9]
BLANCO R, 2007, P 30 ANN INT ACM SIG, P777
[10]
Blanco R, 2007, LECT NOTES COMPUT SC, V4425, P64