Collection statistics for fast duplicate document detection

被引:136
作者
Chowdhury, A [1 ]
Frieder, O [1 ]
Grossman, D [1 ]
McCabe, MC [1 ]
机构
[1] IIT, Informat Retrieval Lab, Chicago, IL 60616 USA
关键词
D O I
10.1145/506309.506311
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. These collections include a 30 MB 18,577 web document collection developed by Excite@ Home and three NIST collections. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly similar in the number of documents to the Excite@ Home collection. The other two collections are both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5-528,023 document collection. We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. We compared our solution to the state of the art and found that in addition to improved accuracy of detection, our approach executed in roughly one-fifth the time.
引用
收藏
页码:171 / 191
页数:21
相关论文
共 21 条
[1]  
[Anonymous], TR19975 U GLASG DEP
[2]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[3]  
BRIN S, 1995, P SPEC INT GROUP MAN, P298
[4]  
Broder A. Z., 1997, P 6 INT WORLD WID WE, V29, P1157, DOI [DOI 10.1016/S0169-7552(97)00031-7, 10.1016/S0169-7552(97)00031-7]
[5]  
Buckley C, 1999, P TIPSTER PHAS 3 SAN, P107
[6]  
CHOWDHURY A, 2000, P 9 TEXT RETR C TREC
[7]  
FRIEDER O, 2000, J DIG INF, V1, P5
[8]  
GROSSMAN D, 1993, P 4 TEXT RETR C TREC
[9]  
Heintze N, 1996, PROCEEDINGS OF THE SECOND USENIX WORKSHOP ON ELECTRONIC COMMERCE, P191
[10]   DISCRIMINATION OF AUTHORSHIP USING VISUALIZATION [J].
KJELL, B ;
WOODS, WA ;
FRIEDER, O .
INFORMATION PROCESSING & MANAGEMENT, 1994, 30 (01) :141-150