Creating probabilistic databases from duplicated data

被引:37
作者
Hassanzadeh, Oktie [1 ]
Miller, Renee J. [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Probabilistic databases; Duplicate detection; String databases;
D O I
10.1007/s00778-009-0161-2
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A major source of uncertainty in databases is the presence of duplicate items, i.e., records that refer to the same real-world entity. However, accurate deduplication is a difficult task and imperfect data cleaning may result in loss of valuable information. A reasonable alternative approach is to keep duplicates when the correct cleaning strategy is not certain, and utilize an efficient probabilistic query-answering technique to return query results along with probabilities of each answer being correct. In this paper, we present a flexible modular framework for scalably creating a probabilistic database out of a dirty relation of duplicated data and overview the challenges raised in utilizing this framework for large relations of string data. We study the problem of associating probabilities with duplicates that are detected using state-of-the-art scalable approximate join methods. We argue that standard thresholding techniques are not sufficiently robust for this task, and propose new clustering algorithms suitable for inferring duplicates and their associated probabilities. We show that the inferred probabilities accurately reflect the error in duplicate records.
引用
收藏
页码:1141 / 1166
页数:26
相关论文
共 51 条
[31]  
Gravano L., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P491
[32]  
Gupta Rahul., 2006, P 32 INT C VERY LARG, P965
[33]  
Gusfield Dan., 1997, Computer science and computational biology
[34]  
HASSANZADEH O, 2007, THESIS U TORONTO
[35]  
HASSANZADEH O, 2009, P INT C VER LARG DAT
[36]  
Hassanzadeh O., 2007, Proc. of the International Workshop on Quality in Databases (QDB), P11
[37]  
HAVELIWALA TH, 2000, P 3 INT WORKSH WEB D, P129
[38]   Real-world data is dirty: Data cleansing and the merge/purge problem [J].
Hernandez, MA ;
Stolfo, SJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (01) :9-37
[39]  
INDYK P, 1997, ACM S THEOR COMP STO, P618
[40]   TECHNIQUES FOR AUTOMATICALLY CORRECTING WORDS IN TEXT [J].
KUKICH, K .
COMPUTING SURVEYS, 1992, 24 (04) :377-439