A semisupervised learning method to merge search engine results

被引:66
作者
Si, L [1 ]
Callan, J [1 ]
机构
[1] Carnegie Mellon Univ, Sch Comp Sci, Language Technol Inst, Pittsburgh, PA 15213 USA
关键词
algorithm; design; experimentation; distributed information retrieval; semisupervised learning method; resource ranking; resource selection; server selection; results merging;
D O I
10.1145/944012.944017
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases ( distributed information retrieval). How to merge the results returned by selected databases is an important subproblem of the distributed information retrieval task. Previous research assumed that either resource providers cooperate to provide normalizing statistics or search clients download all retrieved documents and compute normalized scores without cooperation from resource providers. This article presents a semisupervised learning solution to the result merging problem. The key contribution is the observation that information used to create resource descriptions for resource selection can also be used to create a centralized sample database to guide the normalization of document scores returned by different databases. At retrieval time, the query is sent to the selected databases, which return database-specific document scores, and to a centralized sample database, which returns database-independent document scores. Documents that have both a database-specific score and a database-independent score serve as training data for learning to normalize the scores of other documents. An extensive set of experiments demonstrates that this method is more effective than the well-known CORI result-merging algorithm under a variety of conditions.
引用
收藏
页码:457 / 491
页数:35
相关论文
共 31 条
[1]  
[Anonymous], P AUSTR DAT C
[2]  
Aslam J.A., 2001, P 24 ANN INT ACM SIG
[3]  
BUCKLEY C, 1995, P 1995 TEXT RETR C T
[4]   Query-based sampling of text databases [J].
Callan, J ;
Connell, M .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2001, 19 (02) :97-130
[5]  
CALLAN J, 1995, P 18 ANN INT ACM SIG
[6]  
CALLAN J, 2000, ADV INFORM RETRIEVAL, P127
[7]   TREC AND TIPSTER EXPERIMENTS WITH INQUERY [J].
CALLAN, JP ;
CROFT, WB ;
BROGLIO, J .
INFORMATION PROCESSING & MANAGEMENT, 1995, 31 (03) :327-343
[8]  
CRASWELL N, 2000, P 5 ACM C DIG LIB SA, P37
[9]  
FRENCH JC, 1999, P 22 ANN INT ACM SIG
[10]   A decision-theoretic approach to database selection in networked IR [J].
Fuhr, N .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1999, 17 (03) :229-249