Syntactic clustering of the Web

被引：372

作者：

Broder, AZ

Glassman, SC

Manasse, MS

Zweig, G

机构：

[1] UNIV CALIF BERKELEY,DEPT COMP SCI,BERKELEY,CA 94720

[2] DIGITAL EQUIPMENT CORP,SYST RES CTR,PALO ALTO,CA 94301

来源：

COMPUTER NETWORKS AND ISDN SYSTEMS | 1997年 / 29卷 / 8-13期

关键词：

similarity; duplication; resemblance; Web search; fingerprints; signatures;

D O I：

10.1016/S0169-7552(97)00031-7

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a ''Lost and Found'' service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights. (C) 1997 Published by Elsevier Science B.V.

引用

页码：1157 / 1166

页数：10