Large-Scale Deduplication with Constraints using Dedupalog

被引:91
作者
Arasu, Arvind [1 ]
Re, Christopher [1 ]
Suciu, Dan [1 ]
机构
[1] Microsoft Res, Redmond, WA USA
来源
ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3 | 2009年
关键词
D O I
10.1109/ICDE.2009.43
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue"; if two paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalog-style language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an ad-hoc domain-specific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thorough experimental results over real-world data demonstrating the utility of our framework for high-quality and scalable deduplication.
引用
收藏
页码:952 / 963
页数:12
相关论文
共 53 条
  • [21] Clustering with qualitative information
    Charikar, M
    Guruswami, V
    Wirth, A
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2005, 71 (03) : 360 - 383
  • [22] Chaudhuri S, 2005, PROC INT CONF DATA, P865
  • [23] CHAUDHURI S., 2007, SIGMOD, P437
  • [24] CHRISTEN P, 2008, P AUSTR WORKSH HLTH, P17
  • [25] COHEN WW, 1998, SIGMOD C, P201
  • [26] Cong G., 2007, VLDB
  • [27] CULOTTA A, 2005, CIKM, P257
  • [28] Correlation clustering in general weighted graphs
    Demaine, Erik D.
    Emanuel, Dotan
    Fiat, Amos
    Immorlica, Nicole
    [J]. THEORETICAL COMPUTER SCIENCE, 2006, 361 (2-3) : 172 - 187
  • [29] Dong X., 2005, P 2005 ACM SIGMOD IN, P85, DOI DOI 10.1145/1066157.1066168
  • [30] Duda R. O., 2000, Pattern classification