A systematic study on parameter correlations in large-scale duplicate document detection

被引:12
作者
Ye, Shaozhi [1 ]
Wen, Ji-Rong [2 ]
Ma, Wei-Ying [2 ]
机构
[1] Univ Calif Davis, Dept Comp Sci, Davis, CA 95616 USA
[2] Microsoft Res Asia, Beijing, Peoples R China
关键词
duplicate document detection; clustering; sampling; shingling;
D O I
10.1007/s10115-007-0071-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although much work has been done on duplicate document detection (DDD) and its applications, we observe the absence of a systematic study on the performance and scalability of large-scale DDD algorithms. It is still unclear how various parameters in DDD correlate mutually, such as similarity threshold, precision/recall requirement, sampling ratio, and document size. This paper explores the correlations among several most important parameters in DDD and the impact of sampling ratio is of most interest since it heavily affects the accuracy and scalability of DDD algorithms. An empirical analysis is conducted on a million HTML documents from the TREC .GOV collection. Experimental results show that even when using the same sampling ratio, the precision of DDD varies greatly on documents with different sizes. Based on this observation, we propose an adaptive sampling strategy for DDD, which minimizes the sampling ratio with the constraint of a given precision requirement. We believe that the insights from our analysis are helpful for guiding the future large-scale DDD work.
引用
收藏
页码:217 / 232
页数:16
相关论文
共 27 条
[1]  
[Anonymous], 1981, TR1581 CTR RES COMP
[2]  
BELLARE M, 2004, EUROCRYPT 2004, P401
[3]  
Bharat K, 2000, J AM SOC INFORM SCI, V51, P1114, DOI 10.1002/1097-4571(2000)9999:9999<::AID-ASI1025>3.0.CO
[4]  
2-0
[5]  
Bharat K, 1999, PROCEEDINGS OF THE EIGHTH INTERNATIONAL WORLD WIDE WEB CONFERENCE, P501
[6]  
BRIN S, 1995, P 1995 ACM SIGMOD IN, P398
[7]  
Broder A. Z., 1997, P 6 INT WORLD WID WE, V29, P1157, DOI [DOI 10.1016/S0169-7552(97)00031-7, 10.1016/S0169-7552(97)00031-7]
[8]  
CHO J, 2000, P 2000 ACM INT C MAN, P355
[9]   Collection statistics for fast duplicate document detection [J].
Chowdhury, A ;
Frieder, O ;
Grossman, D ;
McCabe, MC .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) :171-191
[10]  
Conrad JG, 2003, P 12 INT C INF KNOWL, P443, DOI DOI 10.1145/956863.956946