Tag tree template for Web information and schema extraction

被引:19
作者
Ji, Xiangwen [1 ]
Zeng, Jianping [1 ]
Zhang, Shiyong [1 ]
Wu, Chengrong [1 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
关键词
Tag tree template; Web information extraction; Schema extraction; Tree similarity; PAGES;
D O I
10.1016/j.eswa.2010.05.027
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML template from a background database. To efficiently extract meaningful information including records and data schema from the kind of pages, a new method based on Tag tree template is proposed. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The exclusive content in each page is then extracted by using the templates to parse the page. Finally, the records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract meaningful information. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:8492 / 8498
页数:7
相关论文
共 21 条
[1]  
ALANI H, 2003, P ISWC WORKSH, P77
[2]  
Alvarez M, 2007, LECT NOTES COMPUT SC, V4831, P212
[3]  
Arasu A., 2003, P 2003 ACM SIGMOD IN, P337, DOI DOI 10.1145/872757.872799
[4]  
Baumgartner R., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P119
[5]   A survey on tree edit distance and related problems [J].
Bille, P .
THEORETICAL COMPUTER SCIENCE, 2005, 337 (1-3) :217-239
[6]  
Buttler D., 2004, 5 INT C INT COMP
[7]   Self-pumped and mutually pumped phase conjugation in pentagon-shaped BaTiO3 crystal with plus c-face incident geometry [J].
Chang, CC ;
Chen, TC ;
Hu, GW ;
Yau, HF ;
Ye, PX .
PHOTOREFRACTIVE EFFECTS, MATERIALS AND DEVICES, PROCEEDINGS, 2001, 62 :681-681
[8]   Automatic information extraction from semi-structured Web pages by pattern discovery [J].
Chang, CH ;
Hsu, CN ;
Lui, SC .
DECISION SUPPORT SYSTEMS, 2003, 35 (01) :129-147
[9]  
CHANG CH, 2001, P 5 PAC AS C KNOWL D, P4
[10]  
Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109