Probe, cluster, and discover: Focused extraction of QA-Pagelets from the Deep Web

被引:12
作者
Caverlee, J [1 ]
Liu, L [1 ]
Buttler, D [1 ]
机构
[1] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
来源
20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2004年
关键词
D O I
10.1109/ICDE.2004.1319988
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the Deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep web site are grouped into distinct clusters of structurally-similar pages. In the second phase,pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.
引用
收藏
页码:103 / 114
页数:12
相关论文
共 31 条
[1]  
ARASU A, SIGMOD 03
[2]  
BAR-YOSSEF Z., WWW 02
[3]  
Beeferman D., 2000, Proceedings. KDD-2000. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P407, DOI 10.1145/347090.347176
[4]  
BERGMAN M, 2000, BRIGHT PLANET
[5]  
BHARAT K, SIGIR 98
[6]  
BRODER AZ, WWW 97
[7]  
CALLAN J, SIGMOD 99
[8]  
CAVERLEE J, 2003, PROBE CLUSTER DISCOV
[9]  
COHEN W, AAAI 99
[10]  
CRESCENZI V, VLDB 01