A general evaluation framework for topical crawlers

被引:88
作者
Srinivasan, P [1 ]
Menczer, F
Pant, G
机构
[1] Univ Iowa, Sch Lib & Informat Sci, Iowa City, IA 52242 USA
[2] Univ Iowa, Dept Management Sci, Iowa City, IA 52242 USA
[3] Indiana Univ, Sch Informat, Bloomington, IN 47408 USA
[4] Indiana Univ, Dept Comp Sci, Bloomington, IN 47408 USA
[5] Univ Utah, Sch Accounting & Informat Syst, Salt Lake City, UT 84112 USA
来源
INFORMATION RETRIEVAL | 2005年 / 8卷 / 03期
基金
美国国家科学基金会;
关键词
web crawlers; evaluation; tasks; topics; presision; recall; efficiency;
D O I
10.1007/s10791-005-6993-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
引用
收藏
页码:417 / 447
页数:31
相关论文
共 42 条
[1]  
Aggarwal C., 2001, P 10 INT C WORLD WID, P96, DOI DOI 10.1145/371920.371955
[2]  
Amento B., 2000, SIGIR Forum, V34, P296, DOI 10.1145/345508.345603
[3]  
BEAULIEU M, 2000, P 9 TEXT RETR C TREC
[4]   Adding support for dynamic and focused search with Fetuccino [J].
Ben-Shaul, I ;
Herscovici, M ;
Jacovi, M ;
Maarek, YS ;
Pelleg, D ;
Shtalhaim, M ;
Soroka, V ;
Ur, S .
COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 1999, 31 (11-16) :1653-1665
[5]  
Bharat K., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P104, DOI 10.1145/290941.290972
[6]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[7]   Focused crawling: a new approach to topic-specific Web resource discovery [J].
Chakrabarti, S ;
van den Berg, M ;
Dom, B .
COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 1999, 31 (11-16) :1623-1640
[8]   Automatic resource compilation by analyzing hyperlink structure and associated text [J].
Chakrabarti, S ;
Dom, B ;
Raghava, P ;
Rajagopalan, S ;
Gibson, D ;
Kleinberg, J .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :65-74
[9]  
Chakraverty S, 2002, ASP-DAC/VLSI DESIGN 2002: 7TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE AND 15TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, PROCEEDINGS, P251, DOI 10.1109/ASPDAC.2002.994931
[10]   Efficient crawling through URL ordering [J].
Cho, J ;
Garcia-Molina, H ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :161-172