Extracting Web data using instance-based learning

被引:21
作者
Zhai, Yanhong [1 ]
Liu, Bing [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2007年 / 10卷 / 02期
关键词
web content mining; web data extraction; instance-based learning;
D O I
10.1007/s11280-007-0022-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances may not be representative of all other instances. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process to extract the required data items from the new instance. The technique is also very efficient. Experimental results based on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art existing systems significantly.
引用
收藏
页码:113 / 132
页数:20
相关论文
共 26 条
[1]  
[Anonymous], 1997, Machine Learning
[2]   Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry [J].
Chang, CC ;
Chen, TC ;
Hu, GW ;
Yau, HF ;
Ye, PX .
PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 :681-681
[3]  
CHANG CH, 2004, IEEE INTELLIGEN NOV
[4]  
COHEN W, 2002, 11 INT WORLD WID WEB
[5]  
CRESCENZI V, 1980, VLDB 01 P 27 INT C V, P109
[6]  
Embley D. W., 1999, SIGMOD
[7]  
Feldman R., 2002, Computational Linguistics and Intelligent Text Processing. Third International Conference, CICLing 2002. Proceedings (Lecture Notes in Computer Science Vol.2276), P349
[8]  
FREITAG A, 1999, P AAAI 99 WORKSH MAC
[9]  
Freitag D, 2000, SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), P577
[10]  
HAMMER J, 1997, P WORKSH MAN SEM DAT