Interaction between Record Matching and Data Repairing

被引:44
作者
Fan, Wenfei [1 ,2 ]
Ma, Shuai [2 ]
Tang, Nan [3 ]
Yu, Wenyuan [1 ]
机构
[1] Univ Edinburgh, Edinburgh, Midlothian, Scotland
[2] Beihang Univ, SKLSDE Lab, Beijing, Peoples R China
[3] QCRI, Al Rayyan, Qatar
来源
ACM JOURNAL OF DATA AND INFORMATION QUALITY | 2014年 / 4卷 / 04期
基金
英国工程与自然科学研究理事会;
关键词
Design; Algorithms; Performance; Data repairing; record matching; conditional functional dependency; matching dependency;
D O I
10.1145/2567657
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.
引用
收藏
页数:38
相关论文
共 61 条
  • [1] Abiteboul Serge, 1995, FDN DATABASES
  • [2] AIKEN A, 1993, P CSL 1993, P1
  • [3] [Anonymous], 2003, 907522003E ISOIEC
  • [4] [Anonymous], P SIGMOD
  • [5] Large-Scale Deduplication with Constraints using Dedupalog
    Arasu, Arvind
    Re, Christopher
    Suciu, Dan
    [J]. ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 952 - 963
  • [6] Answer sets for consistent query answering in inconsistent databases
    Arenas, M
    Bertossi, L
    Chomicki, J
    [J]. THEORY AND PRACTICE OF LOGIC PROGRAMMING, 2003, 3 : 393 - 424
  • [7] Arenas M., 2003, THEOR PRACT LOG PROG, V3, P4
  • [8] Bayardo R.J., 2007, PROC INT C WORLD WID, P131, DOI DOI 10.1145/1242572.1242591
  • [9] Bertossi L., 2011, P ICDT, P268, DOI DOI 10.1145/1938551.1938585
  • [10] Beskales G., 2009, P VLDB ENDOW, V2, P1