Wrapper induction: Efficiency and expressiveness

被引:231
作者
Kushmerick, N [1 ]
机构
[1] Natl Univ Ireland Univ Coll Dublin, Dept Comp Sci, Dublin 4, Ireland
基金
美国国家科学基金会;
关键词
information extraction; wrapper induction; machine learning; internet information integration; information agents;
D O I
10.1016/S0004-3702(99)00100-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Internet presents numerous sources of useful information-telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then turn to efficiency: we measure the number of examples and time required to learn wrappers in each class, and we compare these results to PAC models of our task and asymptotic complexity analyses of our algorithms. Summarizing our results, we find that most of our wrapper classes are reasonably useful (70% of surveyed sites can be handled in total), yet can rapidly learned (learning usually requires just a handful of examples and a fraction of a CPU second per example). (C) 2000 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:15 / 68
页数:54
相关论文
共 75 条
[1]  
ABITEBOUL S, 1996, P INT C DAT THEOR, P1
[2]  
ADALI S, 1996, P ACM SIGMOD C MAN D
[3]  
ADELBERG B, 1998, P ACM SIGMOD C MAN D
[4]   The constraint-based knowledge broker model: Semantics, implementation and analysis [J].
Andreoli, JM ;
Borghoff, UM ;
Pareschi, R .
JOURNAL OF SYMBOLIC COMPUTATION, 1996, 21 (4-6) :635-667
[5]  
Angluin D., 1992, Proceedings of the Twenty-Fourth Annual ACM Symposium on the Theory of Computing, P351, DOI 10.1145/129712.129746
[6]   INFERENCE OF REVERSIBLE LANGUAGES [J].
ANGLUIN, D .
JOURNAL OF THE ACM, 1982, 29 (03) :741-765
[7]   LEARNING REGULAR SETS FROM QUERIES AND COUNTEREXAMPLES [J].
ANGLUIN, D .
INFORMATION AND COMPUTATION, 1987, 75 (02) :87-106
[8]  
ARENS Y, 1996, RLTR96118 TR USC ROM
[9]  
ASHISH N, 1997, P COOP INF SYST
[10]  
BARTLETT PL, 1991, PROCEEDINGS OF THE FOURTH ANNUAL WORKSHOP ON COMPUTATIONAL LEARNING THEORY, P24