Creating probabilistic databases from duplicated data

被引:37
作者
Hassanzadeh, Oktie [1 ]
Miller, Renee J. [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Probabilistic databases; Duplicate detection; String databases;
D O I
10.1007/s00778-009-0161-2
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.
引用
收藏
页码:1141 / 1166
页数:26
相关论文
共 51 条
[1]  
Ailon Nir, 2005, P 37 ANN ACM S THEOR, P684, DOI DOI 10.1145/1060590.1060692
[2]  
[Anonymous], 2003, Internet Mathematics, DOI [10.1080/15427951.2004.10129093, DOI 10.1080/15427951.2004.10129093]
[3]  
[Anonymous], P SIGMOD
[4]  
[Anonymous], 2004, SIGMOD
[5]  
[Anonymous], P 32 INT C VER LARG
[6]  
[Anonymous], 2003, Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, DOI [DOI 10.1145/872757.872796, 10.1145/872757.872796]
[7]  
[Anonymous], 2006, P ICDE C
[8]  
[Anonymous], 2000, THESIS U ULTRECHT
[9]  
[Anonymous], 2007, Data Mining Workshops
[10]   Fast and simple relational processing of uncertain data [J].
Antova, Lyublena ;
Jansen, Thomas ;
Koch, Christoph ;
Olteanu, Dan .
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, :983-992