Two-stage statistical language models for text database selection

被引:6
作者
Yang, H [1 ]
Zhang, MJ [1 ]
机构
[1] Univ Wollongong, Sch Informat Technol & Comp Sci, Wollongong, NSW 2500, Australia
来源
INFORMATION RETRIEVAL | 2006年 / 9卷 / 01期
关键词
database language model; text database selection; distributed information retrieval; hierarchical topics; statistical language modeling; query expansion;
D O I
10.1007/s10791-005-5719-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As the number and diversity of distributed Web databases on the Internet exponentially increase, it is difficult for user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locating databases relevant to the information needs of users. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent inaccuracy due to word sparseness. Experimental results demonstrate that such a language modeling approach is competitive with current state-of-the-art database selection approaches.
引用
收藏
页码:5 / 31
页数:27
相关论文
共 57 条
[1]  
[Anonymous], 2002, P 25 ANN INT ACM SIG
[2]  
[Anonymous], P ICML 97
[3]  
[Anonymous], P 24 ANN INT ACM SIG, DOI DOI 10.1145/383952.384019
[4]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[5]   A probabilistic solution to the selection and fusion problem in distributed information retrieval [J].
Baumgarten, C .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :246-253
[6]  
Baumgarten C, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P258, DOI 10.1145/278459.258585
[7]  
Berger A, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P222, DOI 10.1145/312624.312681
[8]   Query-based sampling of text databases [J].
Callan, J ;
Connell, M .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2001, 19 (02) :97-130
[9]  
CALLAN J, 2000, ADV INFORM RETRIEVAL, P127
[10]   Word-sequence kernels [J].
Cancedda, N ;
Gaussier, E ;
Goutte, C ;
Renders, JM .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1059-1082