Scaling Up the TREC Collection

被引:8
作者
David Hawking
Paul Thistlewaite
Donna Harman
机构
来源
Information Retrieval | 1999年 / 1卷 / 1-2期
关键词
test collection; very large databases; text retrieval;
D O I
10.1023/A:1009938405269
中图分类号
学科分类号
摘要
Due to the popularity of Web search engines, a large proportion of real text retrieval queries are now processed over collections measured in tens or hundreds of gigabytes. A new Very Large test Collection (VLC) has been created to support qualification, measurement and comparison of systems operating at this level and to permit the study of the properties of very large collections. The VLC is an extension of the well-known TREC collection and has been distributed under the same conditions. A simple set of efficiency and effectiveness measures have been defined to encourage comparability of reporting. The 20 gigabyte first-edition of the VLC and a representative 10% sample have been used in a special interest track of the 1997 Text Retrieval Conference (TREC-6). The unaffordable cost of obtaining complete relevance assessments over collections of this scale is avoided by concentrating on early precision and relying on the core TREC collection to support detailed effectiveness studies. Results obtained by TREC-6 VLC track participants are presented here. All groups observed a significant increase in early precision as collection size increased. Explanatory hypotheses are advanced for future empirical testing. A 100 gigabyte second edition VLC (VLC2) has recently been compiled and distributed for use in TREC-7 in 1998.
引用
收藏
页码:115 / 137
页数:22
相关论文
共 1 条
[1]  
Swets J.A.(1963)Information retrieval systems Science 141 245-250