An information-theoretic perspective of tf-idf measures

被引:755
作者
Aizawa, A [1 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
关键词
tf-idf; term weighting theories; information theory; text categorization;
D O I
10.1016/S0306-4573(02)00021-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency-inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation. (C) 2002 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:45 / 65
页数:21
相关论文
共 46 条
[1]  
AIZAWA A, 2001, P 6 NAT LANG PROC PA, P307
[2]  
Amati G, 1998, KLUW S INF, P189
[3]  
[Anonymous], [No title captured]
[4]  
[Anonymous], P ICML 97
[5]  
[Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
[6]  
[Anonymous], P 23 ANN INT ACM SIG
[7]  
BAEZAYATES R, 1988, MODERN INFORMATION R
[8]   SHANNON MODEL OF IR SYSTEMS [J].
BROOKES, BC .
JOURNAL OF DOCUMENTATION, 1972, 28 (02) :160-&
[9]  
Church K, 1999, TEXT SPEECH LANG TEC, P283, DOI 10.1007/978-94-017-2390-9_18
[10]  
Cover T. M., 2005, ELEM INF THEORY, DOI 10.1002/047174882X