Domain-independent data cleaning via analysis of entity-relationship graph

被引:106
作者
Kalashnikov, Dmitri V. [1 ]
Mehrotra, Sharad [1 ]
机构
[1] Univ Calif Irvine, Irvine, CA 92697 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2006年 / 31卷 / 02期
关键词
design; experimentation; performance; theory; connection strength; data cleaning; entity resolution; graph analysis; reference disambiguation; relationship analysis; RelDC;
D O I
10.1145/1138394.1138401
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 [计算机科学与技术];
摘要
In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose ( called RELDC) and the traditional techniques is that RELDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.
引用
收藏
页码:716 / 767
页数:52
相关论文
共 60 条
[1]
ANANTHAKRISHNA R, 2002, P VLDB C
[2]
[Anonymous], P WORKSH LINK AN COU
[3]
[Anonymous], 2002, Database Systems: The Complete Book
[4]
BHALOTIA G, 2002, P IEEE ICDE C
[5]
BHATTACHARYA I, 2004, P DMKD WORKSH
[6]
BILENKO M, 2003, P ACM SIGKDD C WASH
[7]
BRIN S, 1998, P INT WORLD WID WEB
[8]
CHAUDHUIR S, 2005, P ACM SIGMOD C BALT
[9]
CHAUDHURI S, 2003, P ACM SIGMOD C SAN D
[10]
CHEN Z, 2005, P IQIS WORKSH ACM SI