Automatic information extraction from semi-structured Web pages by pattern discovery

被引:46
作者
Chang, CH [1 ]
Hsu, CN
Lui, SC
机构
[1] Natl Cent Univ, Dept Comp Sci & Informat Engn, Chungli 320, Tauyuan, Taiwan
[2] Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan
[3] ChungHwa Telecommun Labs, Yangmei 326, Tauyuan, Taiwan
关键词
information extraction; semi-structured data; wrapper generation; PAT trees; multiple string alignment;
D O I
10.1016/S0167-9236(02)00100-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of information extractors that can extract structured data from semi-structured Web documents. Previous work in wrapper induction aims at learning extraction rules from user-labeled training examples, which, however, can be expensive in some practical applications. In this paper, we introduce IEPAD (an acronym for Information Extraction based on PAttern Discovery), a system that discovers extraction patterns from Web pages without user-labeled examples. IEPAD applies several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms. Extractors generated by IEPAD can be generalized over unseen pages from the same Web data source. We empirically evaluate the performance of IEPAD on an information extraction task from 14 real Web data sources. Experimental results show that with the extraction rules discovered from a single page, IEPAD achieves 96% average retrieval rate, and with less than five example pages, IEPAD achieves 100% retrieval rate for 10 of the sample Web data sources. (C) 2002 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:129 / 147
页数:19
相关论文
共 30 条
[1]   The Semantic Web - A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities [J].
Berners-Lee, T ;
Hendler, J ;
Lassila, O .
SCIENTIFIC AMERICAN, 2001, 284 (05) :34-+
[2]   FAST STRING SEARCHING ALGORITHM [J].
BOYER, RS ;
MOORE, JS .
COMMUNICATIONS OF THE ACM, 1977, 20 (10) :762-772
[3]  
*BRIGHTPL LLC, 2000, DEEP WEB SURF HIDD V
[4]  
CHANG CH, 2001, P 5 PAC AS C KNOWL D, P4
[5]  
CHANG CH, 2001, LECT NOTES ARTIF INT, V2336, P223
[6]  
Chawathe S.S., 1994, PRCOEEDINGS ACM T CO, P7
[7]  
CHIDLOVSKII B, 1997, P 5 RIAO C MONTR CAN, P123
[8]  
CHIDLOVSKII B, 2000, LNCS SERIES
[9]  
Chien LF, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P50, DOI 10.1145/278459.258534
[10]  
*DEF ADV RES PROJ, 1995, P 6 MESS UND C MUC 6