Mercator: A scalable, extensible Web crawler

被引:193
作者
Heydon A. [1 ]
Najork M. [1 ]
机构
[1] Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, 94301, CA
关键词
Computing Profession; Performance Number;
D O I
10.1023/A:1019213109274
中图分类号
学科分类号
摘要
This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well‐documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator's support for extensibility and customizability. Finally, we comment on Mercator's performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published. © 1999, Kluwer Academic Publishers.
引用
收藏
页码:219 / 229
页数:10
相关论文
共 21 条
[1]  
Altavista Software Search Intranet Home Page,” Altavista.Software.Digital.Com/Search/Intranet
[2]  
Berkeley Internet Name Domain (BIND),”
[3]  
Bloom B., Space/Time Trade-Offs in Hash Coding with Allowable Errors, Communications of the ACM 13, 7, pp. 422-426, (1970)
[4]  
Brinpage S.L., The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference, pp. 107-117, (1998)
[5]  
Broder A., Some Applications of Rabin's Fingerprinting Method, Sequences II: Methods in Communications, Security, and Computer Science, pp. 143-152, (1993)
[6]  
Burner M., Crawling Towards Eternity: Building an Archive of the World Wide Web, Web Techniques Magazine, 2, (1977)
[7]  
Cho J.H., Garcia-Molinapage L., Efficient Crawling Through URL Ordering, Proceedings of the Seventh International World Wide Web Conference, pp. 161-172, (1998)
[8]  
Eichmann D., The RBSE Spider - Balancing Effective Search Against Web Load, Proceedings of the First International World Wide Web Conference, pp. 113-120, (1994)
[9]  
Ghemawat S., “srcjava Home Page
[10]  
Google, “Google! Search Engine