Duplicate record detection: A survey

被引:925
作者
Elmagarmid, Ahmed K. [1 ]
Ipeirotis, Panagiotis G.
Verykios, Vassilios S.
机构
[1] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[2] Purdue Univ, Cyber Ctr, W Lafayette, IN 47907 USA
[3] NYU, Leonard N Stern Sch Business, Dept Informat Operat & Management Sci, New York, NY 10012 USA
[4] Univ Thessaly, Dept Comp & Commun Engn, Volos 38221, Greece
基金
美国国家科学基金会;
关键词
duplicate detection; data cleaning; data integration; record linkage; data deduplication; instance identification; database hardening; name matching; identity uncertainty; entity resolution; fuzzy duplicate detection; entity matching;
D O I
10.1109/TKDE.2007.250581
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 106 条
  • [1] Agichtein Eugene., 2004, P ACM SIGKDD INT C K, P20
  • [2] AGRAWAL R., 2002, P 11 INT WORLD WID W, P420
  • [3] Ahuja RK, 1993, NETWORK FLOWS THEORY
  • [4] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [5] ANANTHAKRISHNA R, 2002, P 28 INT C VER LARG
  • [6] [Anonymous], 2003, 2003 ACM SIGMOD INT, DOI DOI 10.1145/872757.872796
  • [7] [Anonymous], 1976, UNIMATCH: A record linkage system: User's manual
  • [8] [Anonymous], P ICML 97
  • [9] [Anonymous], 2003, Proceedings of the 2003 ACM SIGMOD international conference on Management of data
  • [10] A NEW APPROACH TO TEXT SEARCHING
    BAEZAYATES, R
    GONNET, GH
    [J]. COMMUNICATIONS OF THE ACM, 1992, 35 (10) : 74 - 82