Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

被引:78
作者
Shin, Dongwook [1 ]
Kim, Taehwan [1 ]
Choi, Joongmin [1 ]
Kim, Jungsun [1 ]
机构
[1] Hanyang Univ, Dept Comp Sci & Engn, Ansan 426791, Gyeonggi Do, South Korea
关键词
Author name disambiguation; Graph model; Namesake resolution; Heteronymous name resolution; Digital library; CITATIONS; WEB;
D O I
10.1007/s11192-014-1289-4
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
Author ambiguity mainly arises when several different authors express their names in the same way, generally known as the namesake problem, and also when the name of an author is expressed in many different ways, referred to as the heteronymous name problem. These author ambiguity problems have long been an obstacle to efficient information retrieval in digital libraries, causing incorrect identification of authors and impeding correct classification of their publications. It is a nontrivial task to distinguish those authors, especially when there is very limited information about them. In this paper, we propose a graph based approach to author name disambiguation, where a graph model is constructed using the co-author relations, and author ambiguity is resolved by graph operations such as vertex (or node) splitting and merging based on the co-authorship. In our framework, called a Graph Framework for Author Disambiguation (GFAD), the namesake problem is solved by splitting an author vertex involved in multiple cycles of coauthorship, and the heteronymous name problem is handled by merging multiple author vertices having similar names if those vertices are connected to a common vertex. Experiments were carried out with the real DBLP and Arnetminer collections and the performance of GFAD is compared with three representative unsupervised author name disambiguation systems. We confirm that GFAD shows better overall performance from the perspective of representative evaluation metrics. An additional contribution is that we released the refined DBLP collection to the public to facilitate organizing a performance benchmark for future systems on author disambiguation.
引用
收藏
页码:15 / 50
页数:36
相关论文
共 33 条
[1]
[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
[2]
[Anonymous], 2011, Journal of Information and Data Management
[3]
[Anonymous], 2010, P 10 ANN JOINT C DIG, DOI 10.1145/1816123.1816130
[4]
[Anonymous], 2011, ACM J DATA INF QUAL, DOI DOI 10.1145/1891879.1891883
[5]
Swoosh: a generic approach to entity resolution [J].
Benjelloun, Omar ;
Garcia-Molina, Hector ;
Menestrina, David ;
Su, Qi ;
Whang, Steven Euijong ;
Widom, Jennifer .
VLDB JOURNAL, 2009, 18 (01) :255-276
[6]
Bhattacharya Indrajit, 2006, P 6 SIAM INT C DAT M
[7]
Borgman CL, 1999, INFORM PROCESS MANAG, V35, P227, DOI 10.1016/S0306-4573(98)00059-4
[8]
Cherednichenko S., 2005, THESIS U JOENSUU
[9]
An Unsupervised Heuristic-Based Hierarchical Method for Name Disambiguation in Bibliographic Citations [J].
Cota, Ricardo G. ;
Ferreira, Anderson A. ;
Nascimento, Cristiano ;
Goncalves, Marcos Andre ;
Laender, Alberto H. F. .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (09) :1853-1870
[10]
A Brief Survey of Automatic Methods for Author Name Disambiguation [J].
Ferreira, Anderson A. ;
Goncalves, Marcos Andre ;
Laender, Alberto H. F. .
SIGMOD RECORD, 2012, 41 (02) :15-26