An Unsupervised Heuristic-Based Hierarchical Method for Name Disambiguation in Bibliographic Citations

被引:82
作者
Cota, Ricardo G. [1 ]
Ferreira, Anderson A. [1 ]
Nascimento, Cristiano [1 ]
Goncalves, Marcos Andre [1 ]
Laender, Alberto H. F. [1 ]
机构
[1] Univ Fed Minas Gerais, Dept Comp Sci, BR-31270010 Belo Horizonte, MG, Brazil
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2010年 / 61卷 / 09期
关键词
MODEL;
D O I
10.1002/asi.21363
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Name ambiguity in the context of bibliographic citations is a difficult problem which, despite the many efforts from the research community, still has a lot of room for improvement. In this article, we present a heuristic-based hierarchical clustering method to deal with this problem. The method successively fuses clusters of citations of similar author names based on several heuristics and similarity measures on the components of the citations (e.g., coauthor names, work title, and publication venue title). During the disambiguation task, the information about fused clusters is aggregated providing more information for the next round of fusion. In order to demonstrate the effectiveness of our method, we ran a series of experiments in two different collections extracted from real-world digital libraries and compared it, under two metrics, with four representative methods described in the literature. We present comparisons of results using each considered attribute separately (i.e., coauthor names, work title, and publication venue title) with the author name attribute and using all attributes together. These results show that our unsupervised method, when using all attributes, performs competitively against all other methods, under both metrics, loosing only in one case against a supervised method, whose result was very close to ours. Moreover, such results are achieved without the burden of any training and without using any privileged information such as knowing a priori the correct number of clusters.
引用
收藏
页码:1853 / 1870
页数:18
相关论文
共 45 条
[1]  
[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
[2]  
[Anonymous], 2005, WWW '05
[3]  
[Anonymous], 2005, P 11 ACM SIGKDD INT, DOI DOI 10.1145/1081870.1081948
[4]  
Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
[5]  
BHATTACHARYA I, 2006, 6 SIAM INT C DAT MIN
[6]  
Bordes A, 2005, J MACH LEARN RES, V6, P1579
[7]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[9]  
Cota R., 2007, P 22 BRAZ S DAT JOAO, P20
[10]  
Culotta A., 2007, 6 INT WORKSH INF INT