Probabilistic models of information retrieval based on measuring the divergence from randomness

被引:420
作者
Amati, G
Van Rijsbergen, CJ
机构
[1] Fdn Ugo Bordoni, I-00142 Rome, Italy
[2] Univ Glasgow, Dept Comp Sci, Glasgow G12 8QQ, Lanark, Scotland
关键词
algorithms; experimentation; theory; aftereffect model; BM25; binomial law; Bose-Einstein statistics; document length normalization; eliteness; idf; information retrieval; Laplace; Poisson; probabilistic models; randomness; succession law; term frequency normalization; term weighting;
D O I
10.1145/582415.582416
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose-Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document-query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model.
引用
收藏
页码:357 / 389
页数:33
相关论文
共 38 条
[1]  
ALLAN J, 1996, NIST SPECIAL PUBLICA, P119
[2]  
AMATI G, 2001, NIST SPECIAL PUBLICA
[3]  
[Anonymous], 1961, The Algebra of Probable Inference
[4]  
[Anonymous], 1995, The Logic of Scientific Discovery
[5]  
Bell T. C., 1999, Managing Gigabytes, V2nd ed
[6]   PROBABILISTIC MODELS FOR AUTOMATIC INDEXING [J].
BOOKSTEIN, A .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1974, 25 (05) :312-318
[7]   FOUNDATIONS OF PROBABILISTIC AND UTILITY-THEORETIC INDEXING [J].
COOPER, WS ;
MARON, ME .
JOURNAL OF THE ACM, 1978, 25 (01) :67-80
[8]   USING PROBABILISTIC MODELS OF DOCUMENT-RETRIEVAL WITHOUT RELEVANCE INFORMATION [J].
CROFT, WB ;
HARPER, DJ .
JOURNAL OF DOCUMENTATION, 1979, 35 (04) :285-295
[9]   AN EXPERIMENT IN AUTOMATIC-INDEXING [J].
DAMERAU, FJ .
AMERICAN DOCUMENTATION, 1965, 16 (04) :283-289
[10]  
Feller W., 1968, INTRO PROBABILITY TH