基于分块的网页信息解析器的研究与设计

被引:51
作者
于满泉
陈铁睿
许洪波
机构
[1] 中国科学院计算技术研究所
关键词
Web挖掘; HTML标记; 视觉特征; 网页分块;
D O I
暂无
中图分类号
TP393.09 [];
学科分类号
080402 ;
摘要
详细介绍了网页信息解析的基本技术手段,在综合权衡优缺点的基础上,提出了针对新 闻网站复杂结构页面较为有效的分块算法,并结合实际的项目需求,设计实现了网页信息解析器 TVPS,实验结果表明,该解析器具有良好的性能,满足实际的需求。
引用
收藏
页码:974 / 976
页数:3
相关论文
共 9 条
[1]  
M icrosoft Research Asia at The W eb Track of TREC 2003. WEN JR,SONG RH,CAI D,et al. The Twelfth Text Retrieval Conference(TREC 12 ) . 2003
[2]  
Record-Boundary D iscovery in W eb Documents. EMBLEY DW,JIANG YS,NG YK. SIGMOD 99 Proceedings . 1999
[3]  
Record Location and Reconfiguration in Unstructured Multiple-Record W eb Documents. EMBLEY DW,LI X. W ebDB 00 Proceedings . 2000
[4]  
D iscovering Informative Content B locks from W eb Documents. LIN SH,HO JM. KDD . 2002
[5]  
ImprovingPseudoRelevanceFeed backinWebInformationRetrievalUsingWebPageSegmentation. YUSP,CAID,WENJR,etal. http://research.microsoft.com/research/pubs/view.as px? type=Technical%20Report & id=632 . 2002
[6]  
ExtractingStructuresofHTMLDocumentsUsingaHighLevelStackMachine. LIMSJ,NGYK. InformationNetworkinginAsia . 2001
[7]  
The W3C Protocol Library. http://www.w3.org/Library/ . 2004
[8]  
A Heuristic Approach for Converting HTML Documents to XML Documents. LIM SJ,NG YK. Proceedings of the Sixth International Conference on Rules and Objects in Databases(DOOD 2000 )[C] . 2000
[9]  
IntegratingHTML TablesUsing Semantic H ierarchies And Meta-Data Sets. LIM SJ,NG YK,YANG XC. International Database Engineering and Applications Symposium ( IDEAS 02 )[C] . 2002