Descriptive and Prescriptive Data Cleaning

被引:26
作者
Chalamalla, Anup [1 ]
Ilyas, Ihab F. [1 ]
Ouzzani, Mourad [2 ]
Papotti, Paolo [2 ]
机构
[1] Univ Waterloo, Waterloo, ON, Canada
[2] QCRI, Ar Rayyan, Qatar
来源
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2014年
关键词
D O I
10.1145/2588555.2610520
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.
引用
收藏
页码:445 / 456
页数:12
相关论文
共 22 条
[1]  
[Anonymous], 2013, ICDE
[2]  
[Anonymous], 2012, SYNTH LECT DATA MANA
[3]   Sampling the Repairs of Functional Dependency Violations under Hard Constraints [J].
Beskales, George ;
Ilyas, Ihab F. ;
Golab, Lukasz .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01) :197-207
[4]   Provenance in Databases: Why, How, and Where [J].
Cheney, James ;
Chiticariu, Laura ;
Tan, Wang-Chiew .
FOUNDATIONS AND TRENDS IN DATABASES, 2007, 1 (04) :379-474
[5]  
Chvatal V., 1979, Mathematics of Operations Research, V4, P233, DOI 10.1287/moor.4.3.233
[6]   On the Complexity of View Update Analysis and Its Application to Annotation Propagation [J].
Cong, Gao ;
Fan, Wenfei ;
Geerts, Floris ;
Li, Jianzhong ;
Luo, Jizhou .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (03) :506-519
[7]  
Cui Y., 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), P367, DOI 10.1109/ICDE.2000.839437
[8]  
Dallachiesa M., 2013, SIGMOD
[9]   Propagating Functional Dependencies with Conditions [J].
Fan, Wenfei ;
Ma, Shuai ;
Hu, Yanli ;
Liu, Jie ;
Wu, Yinghui .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01) :391-407
[10]   The LLUNATIC Data-Cleaning Framework [J].
Geerts, Floris ;
Mecca, Giansalvatore ;
Papotti, Paolo ;
Santoro, Donatello .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (09) :625-636