Real-world data is dirty: Data cleansing and the merge/purge problem

被引:368
作者
Hernandez, MA [1 ]
Stolfo, SJ [1 ]
机构
[1] Columbia Univ, Dept Comp Sci, New York, NY 10027 USA
基金
美国国家科学基金会;
关键词
data cleaning; data cleansing; duplicate elimination; semantic integration;
D O I
10.1023/A:1009761603038
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quire good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.
引用
收藏
页码:9 / 37
页数:29
相关论文
共 24 条
  • [1] *ACM, 1991, SIGMOD REC
  • [2] Agrawal R., 1988, Proceedings. International Symposium on Databases in Parallel and Distributed Systems (IEEE Cat. No.88CH2665-8), P56, DOI 10.1109/DPDS.1988.675002
  • [3] [Anonymous], STAT COMPUTING, DOI DOI 10.1007/BF01889984
  • [4] BATINI C, 1986, COMPUT SURV, V18, P323, DOI 10.1145/27633.27634
  • [5] DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES
    BITTON, D
    DEWITT, DJ
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02): : 255 - 265
  • [6] A FUZZY REPRESENTATION OF DATA FOR RELATIONAL DATABASES
    BUCKLES, BP
    PETRY, FE
    [J]. FUZZY SETS AND SYSTEMS, 1982, 7 (03) : 213 - 226
  • [7] BUCKLEY JP, 1995, P IEEE INT C SYST MA, P3573
  • [8] CLARK TK, 1995, KDD NUGGETS, V95, P7
  • [9] DIETTERICH TG, 1983, MACHINE LEARNING ART, V1, P41, DOI DOI 10.1007/978-3-662-12405-53
  • [10] CLUSTERING TECHNIQUES - USERS DILEMMA
    DUBES, R
    JAIN, AK
    [J]. PATTERN RECOGNITION, 1976, 8 (04) : 247 - 260