Scheduling algorithms for Web crawling

被引:22
作者
Castillo, C [1 ]
Marin, M [1 ]
Rodriguez, A [1 ]
Baeza-Yates, R [1 ]
机构
[1] Univ Chile, Ctr Web Res, Santiago, Chile
来源
WEBMEDIA & LA-WEB 2004, VOL 1, PROCEEDINGS | 2004年
关键词
D O I
10.1109/WEBMED.2004.1348139
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This article presents a comparative study of strategies for Web crawling. We show that a combination of breadth first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.
引用
收藏
页码:10 / 17
页数:8
相关论文
共 41 条
[1]  
AILLERET S, 2004, LARBIN
[2]  
[Anonymous], 2001, Proceedings of the 10th international conference on World Wide Web
[3]  
[Anonymous], 2000, P ACM SIGMOD INT C M, DOI DOI 10.1145/342009.335391
[4]   Relating Web characteristics with link based Web page ranking [J].
Baeza-Yates, R ;
Castillo, C .
EIGHTH SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2001, :21-32
[5]  
BAEZAYATES R, 2002, SOFT COMPUTING SYSTE, P565
[6]  
BRANDMAN O, 2000, P WORKSH PERF ARCH W
[7]  
BREWINGTON B, 2000, P 9 INT WORLD WID WE, P257
[8]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[9]   Graph structure in the Web [J].
Broder, A ;
Kumar, R ;
Maghoul, F ;
Raghavan, P ;
Rajagopalan, S ;
Stata, R ;
Tomkins, A ;
Wiener, J .
COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 2000, 33 (1-6) :309-320
[10]  
BURKE RD, 2000, P 1 ACM IEEE CS JOIN