Probe, cluster, and discover: Focused extraction of QA-Pagelets from the Deep Web

被引:12
作者
Caverlee, J [1 ]
Liu, L [1 ]
Buttler, D [1 ]
机构
[1] Georgia Inst Technol, Coll Comp, Atlanta, GA 30332 USA
来源
20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2004年
关键词
D O I
10.1109/ICDE.2004.1319988
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the Deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep web site are grouped into distinct clusters of structurally-similar pages. In the second phase,pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.
引用
收藏
页码:103 / 114
页数:12
相关论文
共 31 条
[21]  
MENG W, COOPIS 99
[22]  
NIERMAN A, 2002, WEBDB
[23]   AN ALGORITHM FOR SUFFIX STRIPPING [J].
PORTER, MF .
PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1980, 14 (03) :130-137
[24]  
RAGGETT D, 1999, CLEAN UP YOUR WEB PA
[25]   VECTOR-SPACE MODEL FOR AUTOMATIC INDEXING [J].
SALTON, G ;
WONG, A ;
YANG, CS .
COMMUNICATIONS OF THE ACM, 1975, 18 (11) :613-620
[26]  
SALTON G, 1997, READINGS INFORMATION
[27]   A MATHEMATICAL THEORY OF COMMUNICATION [J].
SHANNON, CE .
BELL SYSTEM TECHNICAL JOURNAL, 1948, 27 (03) :379-423
[28]   A MATHEMATICAL THEORY OF COMMUNICATION [J].
SHANNON, CE .
BELL SYSTEM TECHNICAL JOURNAL, 1948, 27 (04) :623-656
[29]  
STEINBACH M, 2000, KDD WORKSH TEXT MINI
[30]  
ZAMIR O, SIGIR 98