Trawling the Web for emerging cyber-communities

被引:310
作者
Kumar, R [1 ]
Raghavan, P [1 ]
Rajagopalan, S [1 ]
Tomkins, A [1 ]
机构
[1] IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA
关键词
Web mining; communities; trawling; link analysis;
D O I
10.1016/S1389-1286(99)00040-7
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The Web harbors a large number of communities - groups of content-creators sharing a common interest - each of which manifests itself as a set of interlinked Web pages. Newgroups and commercial Web directories together contain of the order of 20,000 such communities; our particular interest here is on emerging communities - those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a Web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment. (C) 1999 Published by Elsevier Science B.V. All rights reserved.
引用
收藏
页码:1481 / 1493
页数:13
相关论文
共 23 条
[1]   Querying documents in object databases [J].
Abiteboul S. ;
Cluet S. ;
Christophides V. ;
Milo T. ;
Moerkotte G. ;
Siméon J. .
International Journal on Digital Libraries, 1997, 1 (1) :5-19
[2]  
AGRAWAL R, 1994, P VLDB SANT CHIL SEP
[3]  
[Anonymous], P ACM SIGCHI C HUM F
[4]  
[Anonymous], P 9 ACM C HYP HYP
[5]  
[Anonymous], 1998, Proceedings of the 7th international conference on World Wide Web (WWW), DOI [10.1016/S0169-7552(98)00110-X, DOI 10.1016/S0169-7552(98)00110-X]
[6]  
[Anonymous], 1998, P ACM SIAM S DISCR A
[7]  
BHARAT K, 1998, P 21 SIGIR C MELB AU
[8]  
BHARAT K, 1998, P 7 INT WORLD WID WE, P379
[9]  
Broder A. Z., 1997, P 6 INT WORLD WID WE, V29, P1157, DOI [10.1016/S0169-7552(97)00031-7, DOI 10.1016/S0169-7552(97)00031-7]
[10]  
CARRIERE J, 1997, P 6 INT WORLD WID WE