DEByE - Data extraction by example

被引:81
作者
Laender, AHF [1 ]
Ribeiro-Neto, B [1 ]
da Silva, AS [1 ]
机构
[1] Univ Fed Minas Gerais, ICEx, Dept Comp Sci, BR-31270901 Belo Horizonte, MG, Brazil
基金
美国国家科学基金会;
关键词
data extraction; wrapper generation; web data management;
D O I
10.1016/S0169-023X(01)00047-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present DEByE (Data Extraction BY Example), an approach to extracting data from Web sources, based on a small set of examples specified by the user. The novelty is in the fact that the user specifies examples according to a structure of his liking and that this structure is described at example specification time. For the specification of the examples. the user interacts with a tool we developed which adopts nested tables as its visual paradigm. Nested tables are simple, intuitive, and allow shielding the user from technical details (such as HTML tags, formatting operators, and learning automata) related to the extraction problem. The examples provided by the user are then used to generate patterns which allow extracting data from new documents. For the extraction, DEByE adopts a new bottom-up procedure we proposed which is very effective with various Web sources, as demonstrated by our experiments. (C) 2002 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:121 / 154
页数:34
相关论文
共 44 条
[1]  
ABITEBOUL S, 1995, VDN DATABASES
[2]  
Abiteboul S., 1999, DATA WEB RELATIONS S
[3]  
Adelberg Brad, 1998, SIGMOD, 1998, P283, DOI [10.1145/276304.276330, DOI 10.1145/276304.276330]
[4]  
Ashish N., 1997, SIGMOD Record, V26, P8, DOI 10.1145/271074.271078
[5]  
Atzeni P., 1997, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1997, P144, DOI 10.1145/263661.263678
[6]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[7]  
Bray T., EXTENSIBLE MARKUP LA
[8]  
Brin S, 1999, LECT NOTES COMPUT SC, V1590, P172
[9]  
Buneman P., 1997, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1997, P117, DOI 10.1145/263661.263675
[10]  
BUNEMAN P, 1999, WORKSH QUER PROC SEM