Automating the approximate record-matching process

被引:49
作者
Verykios, VS
Elmagarmid, AK [1 ]
Houstis, EN
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] Drexel Univ, Coll Informat Sci & Technol, Philadelphia, PA 19104 USA
关键词
D O I
10.1016/S0020-0255(00)00013-X
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data quality has many dimensions one of which is accuracy. Accuracy is usually compromised by errors accidentally or intensionally introduced in a database system. These errors result in inconsistent, incomplete, or erroneous data elements. For example, a small variation in the representation of a data object, produces a unique instantiation of the object being represented. In order to improve the accuracy of the data stored in a database system, we need to compare them either with real-world counterparts or with other data stored in the same or a different system. In this paper, we address the problem of matching records which refer to the same entity by computing their similarity. Exact record matching has limited applicability in this context since even simple errors like character transpositions cannot be captured in the record-linking process. Our methodology deploys advanced data-mining techniques for dealing with the high computational and inferential complexity of approximate record matching. (C) 2000 Elsevier Science Inc. All rights reserved.
引用
收藏
页码:83 / 98
页数:16
相关论文
共 20 条
[1]   DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES [J].
BITTON, D ;
DEWITT, DJ .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02) :255-265
[2]  
COCHINWALA M, 1998, EFFICIENT DATA RECON
[3]  
ELMAGARMID AK, 1996, ISSUES MULTISYSTEM I
[4]   A THEORY FOR RECORD LINKAGE [J].
FELLEGI, IP ;
SUNTER, AB .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1969, 64 (328) :1183-&
[5]   Specification and management of interdependent data in operational systems and data warehouses [J].
Georgakopoulos, D ;
Karabatis, G ;
Gantimahapatruni, S .
DISTRIBUTED AND PARALLEL DATABASES, 1997, 5 (02) :121-166
[6]  
GIARRATANO JC, 1991, CLIPS USERS GUIDE VE
[7]  
HERNADEZ MA, 1998, J DATA MINING KNOWLE, V1
[8]  
KOHAVI R, 1996, DATA MINING USING ML
[9]  
MANBER U, 1989, INTRO ALGORITHMS
[10]  
Monge A, 1997, P ACM SIGMOD WORKSHO, P23