Improving the performance of focused web crawlers

被引:83
作者
Batsakis, Sotiris [1 ]
Petrakis, Euripides G. M. [1 ]
Milios, Evangelos [2 ]
机构
[1] TUC, Dept Elect & Comp Engn, GR-73100 Khania, Crete, Greece
[2] Dalhousie Univ, Fac Comp Sci, Halifax, NS B3H 1W5, Canada
关键词
Focused crawler; Learning crawler; Hidden Markov Model (HMM) crawler; World Wide Web; ALGORITHM;
D O I
10.1016/j.datak.2009.04.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics. (C) 2009 Elsevier B.V. All rights reserved.
引用
收藏
页码:1001 / 1013
页数:13
相关论文
共 34 条
[1]  
Aggarwal CharuC., 2001, P 10 INT WORLD WIDE, P96, DOI DOI 10.1145/371920.371955
[2]  
[Anonymous], 2005, P ACL WORKSHOP EMPIR
[3]  
[Anonymous], 2017, INT
[4]  
[Anonymous], 1994, 4 INT C INTELLIGENTM
[5]  
[Anonymous], 1998, Computer networks and ISDN systems, DOI [DOI 10.1016/S0169-7552(98)00110-X, 10.1016/S0169-7552(98)00110-X]
[6]  
Badia A., 2006, P 15 INT C WORLD WID, P1043
[7]   Competitor mining with the web [J].
Bao, Shenghua ;
Li, Rui ;
Yu, Yong ;
Cao, Yunbo .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (10) :1297-1310
[8]  
BERGMARK D, 2002, P 2 ACM IEEE CS JOIN
[9]  
BERGMARK D, 2002, 6 EUR C DIG LIB ROM
[10]  
CHAKRABARTI S, 1999, P 8 INT WORLD WID WE