Web data extraction based on structural similarity

被引:29
作者
Li, Z [1 ]
Ng, WK [1 ]
Sun, AX [1 ]
机构
[1] Nanyang Technol Univ, Ctr Adv Informat Syst, Sch Comp Engn, Singapore, Singapore
关键词
classification; clustering; framework; web data extraction;
D O I
10.1007/s10115-004-0188-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document an be represented as a vector of schema, it can be easily incorporated into existing systems as the fabric for integration.
引用
收藏
页码:438 / 461
页数:24
相关论文
共 21 条
[1]
AKUTSU T, 1992, IEICE T INF SYST, VE75D, P95
[2]
Arasu A, 2003, SIGMOD'03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, P337, DOI DOI 10.1145/872757.872799
[3]
Baumgartner R., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P119
[4]
Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry [J].
Chang, CC ;
Chen, TC ;
Hu, GW ;
Yau, HF ;
Ye, PX .
PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 :681-681
[5]
Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
[6]
FLESCA S, 2002, P 5 INT WORKSH WEB D
[7]
GOTTLOB G, 2000, P 21 PODS, P17
[8]
Virtual database technology [J].
Gupta, A ;
Harinarayan, V ;
Rajaraman, A .
14TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 1998, :297-301
[9]
KARYPIS G, 2002, TR2017 U MINN
[10]
KOSALA R, 2003, P 18 UJCAI 2003