Conditional functional dependencies for capturing data inconsistencies

被引:244
作者
Fan, Wenfei [1 ]
Geerts, Floris [1 ]
Jia, Xibei [1 ]
Kementsietsidis, Anastasios [2 ]
机构
[1] Univ Edinburgh, Sch Informat, Lab Fdn Comp Sci, Edinburgh EH8 9YL, Midlothian, Scotland
[2] IBM TJ Watson Res Ctr, Hawthorne, NY 10532 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2008年 / 33卷 / 02期
基金
英国工程与自然科学研究理事会;
关键词
algorithms; experimentation; performance; theory; data cleaning; functional dependency; SQL;
D O I
10.1145/1366102.1366103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.
引用
收藏
页数:48
相关论文
共 51 条
  • [1] Abiteboul S., 1995, Foundations of databases, V1st
  • [2] Answer sets for consistent query answering in inconsistent databases
    Arenas, M
    Bertossi, L
    Chomicki, J
    [J]. THEORY AND PRACTICE OF LOGIC PROGRAMMING, 2003, 3 : 393 - 424
  • [3] ARMSTRONG WW, 1974, P IFIP, V74, P580
  • [4] Constraint-generating dependencies
    Baudinet, M
    Chomicki, J
    Wolper, P
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1999, 59 (01) : 94 - 115
  • [5] A PROOF PROCEDURE FOR DATA DEPENDENCIES
    BEERI, C
    VARDI, MY
    [J]. JOURNAL OF THE ACM, 1984, 31 (04) : 718 - 741
  • [6] Beeri C., 1979, ACM Transactions on Database Systems, V4, P30, DOI 10.1145/320064.320066
  • [7] Bertossi L, 2004, LOGICS FOR EMERGING APPLICATIONS OF DATABASES, P43
  • [8] Bohannon Philip., 2005, SIGMOD Conference, P143, DOI [10.1145/1066157.1066175, DOI 10.1145/1066157.1066175]
  • [9] Bohannon Philip, 2007, P 23 INT C DAT ENG I, P746
  • [10] BRA PD, 1983, C AUT LANG PROGR, P67