Hierarchical wrapper induction for semistructured information sources

被引:145
作者
Muslea, I
Minton, S
Knoblock, CA
机构
[1] Univ So Calif, Inst Informat Sci, Marina Del Rey, CA 90292 USA
[2] Univ So Calif, Integrated Media Syst Ctr, Marina Del Rey, CA 90292 USA
基金
美国国家科学基金会;
关键词
wrapper induction; information extraction; supervised inductive learning; information agents;
D O I
10.1023/A:1010022931168
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.
引用
收藏
页码:93 / 114
页数:22
相关论文
共 15 条
[1]   Semi-automatic wrapper generation for Internet information sources [J].
Ashish, N ;
Knoblock, CA .
PROCEEDINGS OF THE SECOND IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS - COOPIS'97, 1997, :160-169
[2]  
Atzeni P., 1997, Proceedings of the Workshop on Management of Semi-Structured Data, P1
[3]  
Atzeni P., 1997, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1997, P144, DOI 10.1145/263661.263678
[4]  
Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
[5]  
Chawathe S.S., 1994, PRCOEEDINGS ACM T CO, P7
[6]  
CHIDLOVSKII B, 1997, P 5 RIAO C MONTR CAN, P123
[7]  
Cohen W. W., 1998, Proceedings of the Second International Conference on Autonomous Agents, P400, DOI 10.1145/280765.280870
[8]  
Freitag D, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P517
[9]   Generating finite-state transducers for semi-structured data extraction from the Web [J].
Hsu, CN ;
Dung, MT .
INFORMATION SYSTEMS, 1998, 23 (08) :521-538
[10]  
KIRK T, 1995, P AAAI 1995 SPRING S, P85