Interpreting TF-IDF term weights as making relevance decisions

被引:472
作者
Wu, Ho Chung [1 ]
Luk, Robert Wing Pong [1 ]
Wong, Kam Fai [2 ]
Kwok, Kui Lam [3 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China
[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China
[3] CUNY, Queens Coll, Dept Comp Sci, Flushing, NY 11367 USA
关键词
design; experimentation; languages; performance; information retrieval; term weight; relevance decision;
D O I
10.1145/1361684.1361686
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these "local" relevance decisions as the "document-wide" relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity -log p((r) over bar |t epsilon d) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by (r) over bar) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.
引用
收藏
页数:37
相关论文
共 94 条
[1]   An information-theoretic perspective of tf-idf measures [J].
Aizawa, A .
INFORMATION PROCESSING & MANAGEMENT, 2003, 39 (01) :45-65
[2]  
Amati G, 1998, KLUW S INF, P189
[3]   Probabilistic models of information retrieval based on measuring the divergence from randomness [J].
Amati, G ;
Van Rijsbergen, CJ .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (04) :357-389
[4]   Probabilistic models of information retrieval based on measuring the divergence from randomness [J].
Amati, G ;
Van Rijsbergen, CJ .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (04) :357-389
[5]  
Baeza-Yates R.A., 1999, Modern Information Retrieval
[6]   A new unified Probabilistic model [J].
Bodoff, D ;
Robertson, S .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (06) :471-487
[7]   PROBABILISTIC MODELS FOR AUTOMATIC INDEXING [J].
BOOKSTEIN, A .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1974, 25 (05) :312-318
[8]   Local versus global link information in the Web [J].
Calado, P ;
Ribeiro-Neto, B ;
Ziviani, N ;
Moura, E ;
Silva, I .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2003, 21 (01) :42-63
[9]  
CLARKE CLA, 2005, P 14 TEXT RETR C
[10]  
Clough P, 2004, LECT NOTES COMPUT SC, V3115, P243