Efficient Similarity Joins for Near-Duplicate Detection

被引:215
作者
Xiao, Chuan [1 ]
Wang, Wei [1 ]
Lin, Xuemin [1 ,4 ]
Yu, Jeffrey Xu [2 ]
Wang, Guoren [3 ]
机构
[1] Univ New S Wales, Sch Comp Sci & Engn, Sydney, NSW 2052, Australia
[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[3] Northeastern Univ, Fac Informat Sci & Technol, Shenyang, Peoples R China
[4] E China Normal Univ, Shanghai, Peoples R China
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2011年 / 36卷 / 03期
基金
澳大利亚研究理事会;
关键词
Algorithms; Performance; Similarity join; near duplicate detection; ALGORITHMS; INFORMATION;
D O I
10.1145/2000824.2000825
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMS-based settings. Experimental results show our proposed algorithms can outperform previous algorithms on several real datasets.
引用
收藏
页数:41
相关论文
共 54 条
  • [1] Abdel-Hamid O., 2009, Proceedings of the 18th international conference on World wide web, P61
  • [2] AGRAWAL S, 2003, P C INN DAT SYST RES
  • [3] [Anonymous], 2003, 2003 ACM SIGMOD INT, DOI DOI 10.1145/872757.872796
  • [4] [Anonymous], P ANN ACM S THEOR CO
  • [5] [Anonymous], P ACM SIGMOD INT C M
  • [6] [Anonymous], 1997, ACM SIGACT NEWS
  • [7] [Anonymous], 1999, P INT C VER LARG DAT
  • [8] ARASU A, 2006, P INT C VER LARG DAT
  • [9] Transformation-based framework for record matching
    Arasu, Arvind
    Chaudhuri, Surajit
    Kaushik, Raghav
    [J]. 2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 40 - 49
  • [10] Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463