Web-crawling reliability

被引:40
作者
Cothey, V [1 ]
机构
[1] Wolverhampton Univ, Sch Comp & Informat Technol, Wolverhampton WV1 1SB, England
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2004年 / 55卷 / 14期
关键词
D O I
10.1002/asi.20078
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selective. I also report the results of a large-scale experimental simulation of Web crawling that illustrates the effects of different crawling policies on data collection. It is concluded that the reliability of Web crawling as a data collection technique is improved by fuller reporting of relevant crawling policies.
引用
收藏
页码:1228 / 1238
页数:11
相关论文
共 36 条
[1]  
Amitay E, 2003, P 14 ACM C HYP HYP, P38
[2]  
[Anonymous], 1993, GUIDELINES ROBOT WRI
[3]  
[Anonymous], 2001, Proceedings of the 10th international conference on World Wide Web
[4]   Engineering a multi-purpose test collection for Web retrieval experiments [J].
Bailey, P ;
Craswell, N ;
Hawking, D .
INFORMATION PROCESSING & MANAGEMENT, 2003, 39 (06) :853-871
[5]  
Bar-Ilan J, 2002, LIBR TRENDS, V50, P371
[6]   Data collection methods on the Web for informetric purposes - A review and analysis [J].
Bar-Ilan, J .
SCIENTOMETRICS, 2001, 50 (01) :7-32
[7]  
BERGMARK D, 2002, P 6 EUR C RES ADV TE, P91
[8]   A technique for measuring the relative size and overlap of public Web search engines [J].
Bharat, K ;
Broder, A .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :379-388
[9]  
Bharat K, 2000, J AM SOC INFORM SCI, V51, P1114, DOI 10.1002/1097-4571(2000)9999:9999<::AID-ASI1025>3.0.CO
[10]  
2-0