Document length normalization

被引:80
作者
Singhal, A
Salton, G
Mitra, M
Buckley, C
机构
[1] Department of Computer Science, 4130 Upson Hall, Cornell University, Ithaca
基金
美国国家科学基金会;
关键词
D O I
10.1016/0306-4573(96)00008-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the TREC collection-a large full-text experimental text collection with widely varying document lengths-we observe that the likelihood of a document being judged relevant by a user increases with the document length. We show that a retrieval strategy, such as the vector-space cosine match, that retrieves documents of different lengths with roughly equal chances, will not optimally retrieve useful documents from such a collection. We present a modified technique-pivoted cosine normalization-that attempts to match the likelihood of retrieving documents of all lengths to the likelihood of their relevance, and show that this technique yields significant improvements in retrieval effectiveness. Copyright (C) 1996 Elsevier Science Ltd
引用
收藏
页码:619 / 633
页数:15
相关论文
共 20 条
[1]  
[Anonymous], NIST SPECIAL PUBLICA
[2]  
BROGLIO J, 1995, NIST SPECIAL PUBLICA, P29
[3]  
BUCKLEY C, 1993, HUMAN LANGUAGE TECHN
[4]  
Buckley C, 1995, NIST SPECIAL PUBLICA, P69
[5]  
Harman D., 1995, P 3 TEXT RETR C TREC, V500- 207, P1
[6]  
Hearst M. A., 1993, P 16 ANN INT ACM SIG, P59
[7]  
KWOK KL, 1995, NIST PUBLICATION, P247
[8]   RELEVANCE WEIGHTING OF SEARCH TERMS [J].
ROBERTSON, SE ;
SPARCK-JONES, K .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1976, 27 (03) :129-146
[9]  
Robertson SE., 1994, P 17 ANN INT ACM SIG, P232
[10]   TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL [J].
SALTON, G ;
BUCKLEY, C .
INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) :513-523