A survey on session detection methods in query logs and a proposal for future evaluation

被引:51
作者
Gayo-Avello, Daniel [1 ]
机构
[1] Univ Oviedo, Dept Comp Sci, Oviedo 33007, Spain
关键词
Web searching; Search engine; Query log; Topical session; Session detection; NEURAL-NETWORK APPLICATIONS; TOPIC IDENTIFICATION; WEB; INFORMATION; MULTITASKING; RELEVANCE;
D O I
10.1016/j.ins.2009.01.026
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Search engine logs provide a highly detailed insight of users' interactions. Hence, they are both extremely useful and sensitive. The datasets publicly available to scholars are, unfortunately, too few, too dated and too small. There are few because search engine companies are reluctant to release such data; they are dated because they were collected in late 1990s or early 2000s: and they are small because they comprise data for at most one day and just a few hundreds of thousands of users. Even worse, the large query log disclosed by AOL in 2006 caused more harm than good because of a big privacy flaw. In this paper the author provides an overall view of the possible applications of query logs, the privacy concerns researchers must face when working on such datasets, and several ways in which query logs can be easily sanitized. One of such measures consists of segmenting the logs into short topical sessions. Therefore, the author offers a comprehensive survey of session detection methods, as well as a thorough description of a new evaluation framework with performance results for each of the different methods. Additionally, a new, simple, but outperforming session detection method is proposed. It is a heuristic-based technique which works on the basis of a geometric interpretation of both the time gap between queries and the similarity between them in order to flag a topic shift. (c) 2009 Elsevier Inc. All rights reserved.
引用
收藏
页码:1822 / 1843
页数:22
相关论文
共 84 条
[71]  
Sproat Richard., 2003, SIGHAN '03: Proceedings of the second SIGHAN workshop on Chinese language processing, P133
[72]   INFORMATION-RETRIEVAL AS A TRIAL-AND-ERROR PROCESS [J].
SWANSON, DR .
LIBRARY QUARTERLY, 1977, 47 (02) :128-148
[73]  
Van Rijsbergen C.J., 1979, Information Retrieval, P112
[74]  
Wen J.-R., 2001, Proceedings of the 10th International Conference on World Wide Web, WWW '01, P162, DOI DOI 10.1145/371920.371974
[75]  
Wen J.R., 2003, Clustering and Information Retrieval, P195
[76]   Query clustering using user logs [J].
Wen, JR ;
Nie, JY ;
Zhang, HJ .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (01) :59-81
[77]   Vox populi: The public searching of the Web [J].
Wolfram, D ;
Spink, A ;
Jansen, BJ ;
Saracevic, T .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2001, 52 (12) :1073-1074
[78]  
WOLFRAM D, P ACSI 2000
[79]  
XIE Y, 2002, IEEE INFOCOM
[80]  
Xiong L., 2007, QUERY LOG ANAL SOCIA