基于Nutch的Web网站定向采集系统

被引：10

作者：

徐健 ^{[1
,2
]}

张智雄 ^{[1
]}

机构：

[1] 中国科学院国家科学图书馆

[2] 中山大学资讯管理系

来源：

关键词：

网站定向采集系统; Nutch; 网站抓取; 网页去噪;

D O I：

暂无

中图分类号：

TP393.09 [];

学科分类号：

080402 ;

摘要：

在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上,提出基于Nutch的Web网站定向采集系统,并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关键问题进行重点探讨。

引用

页码：1 / 6

页数：6

共 10 条

[1]

The Web Curator Tool Project. http://webcurator.sourceforge.net/ . 2009

[2]

Web-Harvest. http://web-har-vest.sourceforge.net/ . 2009

[3]

Html Parser. http://htmlparser.sourceforge.net/ . 2009

[4]

Intute,Best of the Web. http://www.intute.ac.uk/ . 2009

[5]

Yahoo!Developer Network. http://developer.yahoo.com/search/ . 2009

[6]

Nutch Version 0.8.x Tutorial. http://lucene.apache.org/nutch/tutorial8.html . 2009

[7]

Nutch. http://lucene.apache.org/nutch/ .

[8]

DMOZ open directory project. http://dmoz.org .

[9]

Nutch,Open-Source Web Search. Doug Cutting. http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/www2004.pdf . 2009

[10]

HeritrixIntroduction. http://crawler.archive.org/ . 2009