Efficient Selection and Integration of Hidden Web Database

被引:5
作者
Xian, Xuefeng [1 ,2 ]
Zhao, Pengpeng [1 ,2 ]
Yang, Yuanfeng [1 ,2 ]
Xin, Jie [2 ]
Cui, Zhiming [1 ,2 ]
机构
[1] JiangSu Prov Support Software Engn R&D Ctr Modern, Suzhou, Peoples R China
[2] Soochow Univ, Inst Intelligent Informat Proc & Applicat, Suzhou, Peoples R China
关键词
hidden web; data integration; web database selection;
D O I
10.4304/jcp.5.4.500-507
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
An ever increasing amount of valuable information is stored in web databases, "hidden" behind search interfaces. A new application area emerge for information retrieval and integration. There may be hundreds or thousands of web databases providing data of relevance to a particular domain on the web. So a primary challenge to internet-scale hidden web database integration is to determine in which web databases to include in the integration system with the aim of making the system contain as much high-quality data as possible and the least degree of overlap. In this paper, we present an approach to iteratively select and integrate candidate web database. The core of this approach is a benefit function that evaluates how much benefit the web database brings to a given status of an integration system by integrating it. We devise a benefit function based on the volume and quality of those new data that added to integration system by integrating the web database. We show in practice how to efficiently apply our approach to select and integrate web database. Our experiments on real hidden web databases indicate that the selected and integrated result of web databases produced by our approach yields an integration system with a significant higher utilities than a wide range of other strategies.
引用
收藏
页码:500 / 507
页数:8
相关论文
共 16 条
[1]  
Aboulnaga A, 2007, PROC INT CONF DATA, P161
[2]  
Callan J., 2004, P 9 INT C INF KNOWL, P32
[3]  
Callan J. P., 1995, SIGIR Forum, P21
[4]  
Chang KCC, 2004, SIGMOD REC, V33, P61, DOI 10.1145/1031570.1031584
[5]  
DSouza D. J., 2004, P AUSTR DOC COMP S M, P41
[6]  
Gravano L., 2004, P 2004 ACM SIGMOD IN, P767
[7]  
Knight S-A., 2005, INSITE C, P159, DOI DOI 10.28945/493
[8]   Attributes correlation based approach for estimating size of Web databases [J].
School of Information, Renmin University of China, Beijing 100872, China .
Ruan Jian Xue Bao, 2008, 2 (224-236) :224-236
[9]   A probabilistic approach to metasearching with adaptive probing [J].
Liu, ZY ;
Luo, C ;
Cho, JH ;
Chu, WW .
20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, :547-558
[10]  
Michael B., 2001, J ELECT PUBLISHING