Combining text and link analysis for focused crawling - An application for vertical search engines

被引:50
作者
Almpanidis, G. [1 ]
Kotropoulos, C. [1 ]
Pitas, I. [1 ]
机构
[1] Aristotle Univ Thessaloniki, Dept Informat, GR-54124 Thessaloniki, Greece
关键词
focused crawling; information retrieval; latent semantic indexing; text categorisation; vertical search engines; WEB; ALGORITHM; MODEL;
D O I
10.1016/j.is.2006.09.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler self-evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain-specific web documents. Our implementation presents a different approach to focused crawling and aims to overcome the limitations imposed by the need to provide initial data for training, while maintaining a high recall/precision ratio. We compare its efficiency with other well-known web information retrieval techniques. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:886 / 908
页数:23
相关论文
共 66 条
[51]  
RIJSBERGEN K, INFORM RETRIEVAL ONL
[52]  
Robertson S.E., 1992, P 3 TEXT RETRIEVAL C, P21
[53]   VECTOR-SPACE MODEL FOR AUTOMATIC INDEXING [J].
SALTON, G ;
WONG, A ;
YANG, CS .
COMMUNICATIONS OF THE ACM, 1975, 18 (11) :613-620
[54]  
SIZOV S, 2003, P 1 C INN DAT SYST R
[55]   COCITATION IN SCIENTIFIC LITERATURE - NEW MEASURE OF RELATIONSHIP BETWEEN 2 DOCUMENTS [J].
SMALL, H .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1973, 24 (04) :265-269
[56]   SEARCH TERM RELEVANCE WEIGHTING GIVEN LITTLE RELEVANCE INFORMATION [J].
SPARCKJONES, K .
JOURNAL OF DOCUMENTATION, 1979, 35 (01) :30-48
[57]  
SRINIVASAN P, 2002, P ACM INT C RES DEV
[58]  
STEELE R, 2001, P INT COMP 01 LSA VE
[59]  
SULLIVAN D, NOW ITS VECTORIES AR
[60]  
SULLIVAN D, VORTALS ARE COMING V