An investigation of web crawler behavior: characterization and metrics

被引:43
作者
Dikaiakos, MD
Stassopoulou, A
Papageorgiou, L
机构
[1] Univ Cyprus, Dept Comp Sci, CY-1678 Nicosia, Cyprus
[2] Intercoll, Dept Comp Sci, CY-1678 Nicosia, Cyprus
关键词
web characterization; crawlers;
D O I
10.1016/j.comcom.2005.01.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis- a -vis a crawler's preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:880 / 897
页数:18
相关论文
共 21 条
[1]  
Almeida V, 2001, P 6 INT WORKSH WEB C, P299
[2]  
[Anonymous], 2001, WEB PROTOCOLS PRACTI
[3]  
[Anonymous], **NON-TRADITIONAL**
[4]  
ARASU A, 2001, ACM T INTERNET TECHN, V1, P2, DOI DOI 10.1145/383034.383035.D0I:10.1145/383034.383035
[5]  
ARLITT M, 1999, HPL19935R1
[6]  
ARLITT MF, 1996, P ACM SIGMETRICS 96, P126, DOI DOI 10.1145/233013.233034
[7]  
Barford P., 1998, Performance Evaluation Review, V26, P151, DOI 10.1145/277858.277897
[8]   Changes in Web client access patterns: Characteristics and caching implications [J].
Barford P. ;
Bestavros A. ;
Bradley A. ;
Crovella M. .
World Wide Web, 1999, 2 (1-2) :15-28
[9]  
Barford P, 1999, PERFORMANCE EVALUATION REVIEW, SPECIAL ISSUE, VOL 27 NO 1, JUNE 1999, P188, DOI 10.1145/301464.301560
[10]  
Breslau L, 1999, IEEE INFOCOM SER, P126, DOI 10.1109/INFCOM.1999.749260