A probabilistic justification for using tf × idf term weighting in information retrieval

被引:96
作者
Hiemstra D. [1 ]
机构
[1] Centre for Telematics and Information Technology, University of Twente
关键词
Information retrieval theory; Statistical information retrieval; Statistical natural language processing;
D O I
10.1007/s007999900025
中图分类号
学科分类号
摘要
This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well-known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×idf term weighting. The paper shows that the new probabilistic interpretation of tf×idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm. © 2000 Springer-Verlag.
引用
收藏
页码:131 / 139
页数:8
相关论文
共 21 条
[1]  
Clarke C.L.A., Cormack G.V., Tudhope E.A., Relevance ranking for one to three term queries, In: Proc. RIAO '97, 1997, pp. 388-400
[2]  
Cooper W.S., Some inconsistencies and misidentifled modeling assumptions in probabilistic information retrieval, ACM Trans. Information Systems, 13, pp. 100-111, (1995)
[3]  
Croft W.B., Turtle H.R., Text retrieval and inference, Text-based Intelligent Systems. Lawrence Erl-baum, pp. 127-156, (1992)
[4]  
Hawking D., Thistlewaite P., Relevance Weighting Using Distance Between Term Occurrences, (1996)
[5]  
Hiemstra D., A linguistically motivated probabilistic model of information retrieval, Proc. 2nd European Conference On Research and Advanced Technology For Digital Libraries (ECDL-2), pp. 569-584, (1998)
[6]  
Hiemstra D., de Jong F.M.G., Cross-language retrieval in Twenty-One: Using one, some or all possible translations?, In: Proc. 14th Twente Workshop On Language Technology (TWLT-14), 1998, pp. 19-26
[7]  
Hiemstra D., Kraaij W., Twenty-One at TREC-7: Ad-hoc and cross-language track, Proc. 7th Text Retrieval Conference (TREC-7). NIST Special Publications, (1999)
[8]  
Manning O., Schiitze H., Statistical Natural Language Processing: Theory and Practice (draft), (1998)
[9]  
Miller D.R.H., Leek T., Schwartz R.M., BBN at TREC-7: Using Hidden Markov Models for information retrieval, Proc. 7th Text Retrieval Conference, TREC-7. NIST Special Publications, (1999)
[10]  
Mood A.M., Graybill F.A., Introduction to The Theory of Statistics, (1963)