A knowledge-based approach for duplicate elimination in data cleaning

被引：13

作者：

Low, WL ^{[1
]}

Lee, ML ^{[1
]}

Ling, TW ^{[1
]}

机构：

[1] Natl Univ Singapore, Sch Comp, Singapore 117543, Singapore

来源：

INFORMATION SYSTEMS | 2001年 / 26卷 / 08期

关键词：

data cleaning; duplicate elimination; knowledge-based system;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall-precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more. We propose a new method for computing transitive closure under uncertainty for dealing with the merging of groups of inexact duplicate records and explain why small changes to window sizes has little effect on the results of the sorted neighborhood method. Experimental study with two real-world datasets show that this approach can accurately identify duplicates and anomalies with high recall and precision, thus effectively resolving the recall-precision dilemma. (C) 2001 Published by Elsevier Science Ltd.

引用

页码：585 / 606

页数：22

共 31 条

[1] [Anonymous], JESS JAVA EXPERT SYS
[2] BATINI C, 1986, COMPUT SURV, V18, P323, DOI 10.1145/27633.27634
[3] DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES
BITTON, D
DEWITT, DJ
[J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02): : 255 - 265
[4] BRESSAN S, 1997, P ACM SIGMOD INT C M, P525
[5] Calvanese D, 1999, P INT WORKSH DES MAN
[6] COHEN WW, 1998, SIGMOD C, P201
[7] RETE A FAST ALGORITHM FOR THE MANY PATTERN MANY OBJECT PATTERN MATCH PROBLEM
FORGY, CL
[J]. ARTIFICIAL INTELLIGENCE, 1982, 19 (01) : 17 - 37
[8] GALHARDAS DSH, 1999, EXTENSIBLE FRAMEWORK
[9] HERNANDEZ M, 1996, CUCS0051995 COL U DE
[10] Real-world data is dirty: Data cleansing and the merge/purge problem
Hernandez, MA
Stolfo, SJ
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (01) : 9 - 37

← 1 2 3 4 →