TAILOR: A record linkage toolbox

被引:88
作者
Elfeky, MG [1 ]
Verykios, VS [1 ]
Elmagarmid, AK [1 ]
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
来源
18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS | 2002年
关键词
D O I
10.1109/ICDE.2002.994694
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The? proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage model! We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy, and in performance.
引用
收藏
页码:17 / 28
页数:12
相关论文
共 35 条
[1]   TOLERATING NOISY, IRRELEVANT AND NOVEL ATTRIBUTES IN INSTANCE-BASED LEARNING ALGORITHMS [J].
AHA, DW .
INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1992, 36 (02) :267-287
[2]  
[Anonymous], THESIS MIT
[3]   DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES [J].
BITTON, D ;
DEWITT, DJ .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02) :255-265
[4]  
BRADLEY PS, 1998, P 15 INT C MACH LEAR, P91
[5]  
CARUSO F, 2000, P 26 VLDB INT C VER
[6]  
COCHINWALA M, 1998, EFFICIENT DATA RECON
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]  
DEY D, 1998, MANAGE SCI, V28, P31
[9]  
ELFEKY MG, 2001, TAILOR VIEW GRAPHICA
[10]  
Fayyad U, 1996, AI MAG, V17, P37