Engineering a multi-purpose test collection for Web retrieval experiments

被引:62
作者
Bailey, P [1 ]
Craswell, N
Hawking, D
机构
[1] Australian Natl Univ, Dept Comp Sci, Canberra, ACT 0200, Australia
[2] CSIRO, Math & Informat Sci, Canberra, ACT 2601, Australia
关键词
Web retrieval; link-based ranking; distributed information retrieval; test collections;
D O I
10.1016/S0306-4573(02)00084-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoe retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text. WT10g was used in TREC-9 and TREC-2000 and both topic relevance and homepage finding queries and Judgments are available. (C) 2003 Elsevier Ltd. All rights reserved.
引用
收藏
页码:853 / 871
页数:19
相关论文
共 30 条
[1]  
*AL TECHN INC, 2001, QUE SYST ID LANG CHA
[2]  
[Anonymous], 1998, Proceedings of the 7th international conference on World Wide Web (WWW), DOI [10.1016/S0169-7552(98)00110-X, DOI 10.1016/S0169-7552(98)00110-X]
[3]  
BAILEY P, 2000, WWW9 POST P
[4]  
BHARAT K, 1999, P ACM DIG LIBR 99 WO
[5]  
BRODER A, 2000, P WWW9 AMST
[6]  
Cleverdon C. W., 1997, READINGS INFORM RETR, P47
[7]  
CRASWELL N, 2001, P 24 ANN INT ACM SIG, P250
[8]  
Faloutsos M, 1999, COMP COMM R, V29, P251, DOI 10.1145/316194.316229
[9]   Finding information on the World Wide Web: the retrieval effectiveness of search engines [J].
Gordon, M ;
Pathak, P .
INFORMATION PROCESSING & MANAGEMENT, 1999, 35 (02) :141-180
[10]  
HARMAN D, 1997, READINGS INFORMATION, P247