An algorithm to cluster documents based on relevance

被引:6
作者
Desai, M
Spink, A
机构
[1] Univ Pittsburgh, Sch Informat Sci, Pittsburgh, PA USA
[2] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
关键词
D O I
10.1016/j.ipm.2004.05.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Search engines fail to make a clear distinction between items of varying relevance when presenting search results to users. Instead, they rely on the user of the system to estimate which items are relevant, partially relevant, or not relevant. The user of the system is given the task of distinguishing between documents that are relevant to different degrees. This process often hinders the accessibility of relevant or partially relevant documents, particularly when the results set is large and documents of varying relevance are scattered throughout the set. In this paper, we present a clustering scheme that groups documents within relevant, partially relevant, and not relevant regions for a given search. A clustering algorithm accomplishes the task of clustering documents based on relevance. The clusters were evaluated by endusers issuing categorical, interval, and descriptive relevance judgments for the documents returned from a search. The degree of overlap between users and the system for each of the clustered regions was measured to determine the overall effectiveness of the algorithm. This research showed that clustering documents on the Web by regions of relevance is highly necessary and quite feasible. (c) 2004 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1035 / 1049
页数:15
相关论文
共 16 条
[1]  
Cutler M, 1997, PROCEEDINGS OF THE USENIX SYMPOSIUM ON INTERNET TECHNOLOGIES AND SYSTEMS, P241
[2]   Median measure: an approach to IR systems evaluation [J].
Greisdorf, H ;
Spink, A .
INFORMATION PROCESSING & MANAGEMENT, 2001, 37 (06) :843-857
[3]  
HEARST MA, 1996, P 19 ANN INT ACM SIG, P76
[4]  
Notess GR, 1999, ONLINE, V23, P84
[5]  
RAPELA J, 2001, P 3 INT WORKSH WEB I, P61
[6]  
REES A, 1967, FIELD EXPT APPROACH
[7]  
SCHAMBER L, 1994, ANNU REV INFORM SCI, V29, P3
[8]   Document text characteristics affect the ranking of the most relevant documents by expanded structured queries [J].
Sormunen, E ;
Kekäläinen, J ;
Koivisto, J ;
Järvelin, K .
JOURNAL OF DOCUMENTATION, 2001, 57 (03) :358-376
[9]  
SORMUNEN E, 2002, P 25 ANN ACM SIGIR C, V36, P324
[10]   From highly relevant to not relevant: Examining different regions of relevance [J].
Spink, A ;
Greisdorf, H ;
Bateman, J .
INFORMATION PROCESSING & MANAGEMENT, 1998, 34 (05) :599-621