Syntactic clustering of the Web

被引:372
作者
Broder, AZ
Glassman, SC
Manasse, MS
Zweig, G
机构
[1] UNIV CALIF BERKELEY,DEPT COMP SCI,BERKELEY,CA 94720
[2] DIGITAL EQUIPMENT CORP,SYST RES CTR,PALO ALTO,CA 94301
来源
COMPUTER NETWORKS AND ISDN SYSTEMS | 1997年 / 29卷 / 8-13期
关键词
similarity; duplication; resemblance; Web search; fingerprints; signatures;
D O I
10.1016/S0169-7552(97)00031-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a ''Lost and Found'' service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights. (C) 1997 Published by Elsevier Science B.V.
引用
收藏
页码:1157 / 1166
页数:10
相关论文
共 7 条
[1]  
BRIN S, 1995, P ACM SIGMOD ANN C S
[2]  
HEINTZE N, 1996, P 2 USENIX WORKSH EL
[3]  
*IETF WORK GROUP, URN RES NAM
[4]  
MANBER U, 1994, PROCEEDINGS OF THE WINTER 1994 USENIX CONFERENCE, P1
[5]  
Rabin M., 1981, Technical report TR-15-81
[6]  
SHIVAKUMAR N, 1996, P 3 INT C THEOR PRAC
[7]  
Shivakumar N., 1995, P 2 INT C THEOR PRAC