A technique for measuring the relative size and overlap of public Web search engines

被引:88
作者
Bharat, K [1 ]
Broder, A [1 ]
机构
[1] Digital Equipment Corp, Syst Res Ctr, Palo Alto, CA 94301 USA
来源
COMPUTER NETWORKS AND ISDN SYSTEMS | 1998年 / 30卷 / 1-7期
关键词
search engines; coverage; Web page sampling;
D O I
10.1016/S0169-7552(98)00127-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries. Our technique does not require privileged access to any database. It can be implemented by third-party evaluators using only public query interfaces. We present results from our experiments showing size and overlap estimates for HotBot, AltaVista, Excite, and Infoseek as percentages of their total joint coverage in mid 1997 and in November 1997. Our method does not provide absolute values. However using data from other sources we estimate that as of November 1997 the number of pages indexed by HotBot, AltaVista, Excite, and Infoseek were respectively roughly 77M, 100M, 32M, and 17M and the joint total coverage was 160 million pages. We further conjecture that the size of the static, public Web as of November was over 200 million pages. The most startling finding is that the overlap is very small: less than 1.4% of the total coverage, or about 2.2 million pages were indexed by all four engines. (C) 1998 Published by Elsevier Science B.V. All rights reserved.
引用
收藏
页码:379 / 388
页数:10
相关论文
共 4 条
  • [1] BRAKE D, 1997, NEW SCI 0628
  • [2] BRAKE D, 1997, P 6 INT WORLD WID WE, P391
  • [3] Sinclair A., 1993, Algorithms for Random Generation and Counting: A Markov Chain Approach
  • [4] SMITH Z, 1997, WEB TECHNIQUES M MAY