基于DOM的Web信息抽取

被引：10

作者：

崔继馨

张鹏

杨文柱

机构：

[1] 河北工程学院

[2] 河北大学数学与计算机学院河北邯郸

[3] 河北邯郸

[4] 河北保定

来源：

河北农业大学学报 | 2005年 / 03期

关键词：

DOM; 包装器; 抽取规则; 信息抽取;

D O I：

暂无

中图分类号：

TP393 [计算机网络];

学科分类号：

081201 ; 1201 ;

摘要：

为解决因Web信息量巨大且具有动态性、不规则性,Web信息查询和Web信息集成存在很大困难,研究了对HTML格式的Web文档的信息抽取,提出了一种基于DOM的Web信息抽取方法。该方法通过附加语义、样本学习生成基于DOM路径的抽取规则,利用遍历DOM树实现信息抽取。本方法可用于Web查询,也可用于信息集成系统中包装器的构造。

引用

页码：90 / 93

页数：4

共 4 条

[1]

CleanUpYourWebPageswithHTMLTIDY. RAGGETTD. http://www.w3.org/People/Raggett/tidy/ . 1999

[2]

RoadRunner: towards automatic data extraction from large Web sites. VALTER CRESCENZI,GIANSALVATORE MECCA,PAOLO MERIALDO. Proceedings of 27th International Conference on Very Large Database . 2001

[3]

Mediators in the Architecture of Future Information Systems. WIEDERHOLD G. IEEE Computer . 1992

[4]

Template based Wrapper in the TSIMMIS System. JOACHIM HAMMER,HECTOR GARCIA- MOLINA,SVETLOZAR NESTOROV,et al. In Proceedings of the 26th SIGMOD International Conference on Management of Data . 1997

← 1 →