Comparative study of name disambiguation problem using a scalable blocking-based framework

被引:50
作者
On, BW [1 ]
Lee, D [1 ]
Kang, J [1 ]
Mitra, P [1 ]
机构
[1] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
来源
PROCEEDINGS OF THE 5TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, PROCEEDINGS | 2005年
关键词
name disambiguation; blocking; measuring distances;
D O I
10.1145/1065385.1065463
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we consider the problem of ambiguous author names in bibliographic citations, and comparatively study alternative approaches to identify and correct such name variants (e.g., "Vannevar Bush" and "V. Vush"). Our study is based on a scalable two-step framework, where step 1 is to substantially reduce the number of candidates via blocking, and step 2 is to measure the distance of two names via coauthor information. Combining four blocking methods and seven distance measures on four data sets, we present extensive experimental results, and identify combinations that are scalable and effective to disambiguate author names in citations.
引用
收藏
页码:344 / 353
页数:10
相关论文
共 22 条
  • [1] Ananthakrishna R., 2002, VLDB
  • [2] Adaptive name matching in information integration
    Bilenko, M
    Mooney, R
    Cohen, W
    Ravikumar, P
    Fienberg, S
    [J]. IEEE INTELLIGENT SYSTEMS, 2003, 18 (05) : 16 - 23
  • [3] BORKAR VR, 2001, ACM SIGMOD SANT BARB
  • [4] CHAUDHURI S, 2003, ACM SIGMOD
  • [5] COHEN W, 2003, 2WEB WORKSH HELD CON
  • [6] Cristianini N., 2000, Intelligent Data Analysis: An Introduction, DOI 10.1017/CBO9780511801389
  • [7] A THEORY FOR RECORD LINKAGE
    FELLEGI, IP
    SUNTER, AB
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1969, 64 (328) : 1183 - &
  • [8] GRAVANO L, 2003, INTL WORLD WIDE WEB
  • [9] Han H., 2004, ACM IEEE JOINT C DIG
  • [10] Hernandez M.A., 1995, ACM SIGMOD, V95