HW-STALKER: A machine learning-based system for transforming QURE-Pagelets to XML

被引:4
作者
Kovalev, V
Bhowmick, SS [1 ]
Madria, S
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Div Informat Syst, Singapore 639798, Singapore
[2] Univ Missouri, Dept Comp Sci, Rolla, MO 65409 USA
关键词
hidden web; dynamic content; identifiers; facilitators; STALKER; XML; QURE-Pagelets;
D O I
10.1016/j.datak.2005.01.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we address the problem of extracting and transforming dynamically generated hyperlinked hidden web query results to XML. Our approach is based on the STALKER approach. As STALKER was designed to extract data from a single web page, it cannot handle a set of hyperlinked pages. We propose an algorithm called HW-Transform for transforming hidden web query results (also called QURE-Pagelets) to XML format using machine learning by extending STALKER to handle hyperlinked hidden web pages. One of the key features of our approach is that we identify and transform key attributes of query results into XML attributes. These key attributes facilitate applications such as change detection and data integration by efficiently identifying related or identical results. Based on the proposed algorithm, we have implemented a prototype system called HW-STALKER using Java. Our experiments demonstrate that HW-Transform shows acceptable performance for transforming QURE-Pagelets to XML. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:241 / 276
页数:36
相关论文
共 20 条
[1]  
BARYOSSEF Z, 2002, P WORLD WID WEB C
[2]  
BAUMGARTNER R, 2001, P 27 VLDB C ROM
[3]  
CAVERLEE JB, 2004, P INT C DAT ENG ICDE
[4]  
CHAKRABARTI S, 1999, 8 WORLD WID WEB C MA
[5]  
Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
[6]  
DAVULKU H, 1999, ACM C MAN DAT SIGMOD
[7]  
DILIGENTI M, 2000, 26 INT C VER LARG DA
[8]   Machine learning for information extraction in informal domains [J].
Freitag, D .
MACHINE LEARNING, 2000, 39 (2-3) :169-202
[9]  
Hammer J., 1997, SIGMOD Record, V26, P532, DOI 10.1145/253262.253395
[10]  
KNOBLOCK CA, 2000, IEEE DATA ENG B, V23, P33