Data sets for author name disambiguation: an empirical analysis and a new resource

被引:51
作者
Mueller, Mark-Christoph [1 ]
Reitz, Florian [2 ]
Roy, Nicolas [3 ]
机构
[1] Heidelberg Inst Theoret Studies, Heidelberg, Germany
[2] DBLP, Trier, Germany
[3] FIZ Karlsruhe, Math Dept, Berlin, Germany
关键词
Author name disambiguation; Author name homography; Author name variability; Data sets; Digital libraries;
D O I
10.1007/s11192-017-2363-5
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.
引用
收藏
页码:1467 / 1500
页数:34
相关论文
共 39 条
[1]
[Anonymous], 1998, PROC 1 LANGUAGE RESO
[2]
Arehart MD, 2008, SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, P1136
[3]
Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references [J].
Bornmann, Lutz ;
Mutz, Ruediger .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (11) :2215-2222
[4]
An Unsupervised Heuristic-Based Hierarchical Method for Name Disambiguation in Bibliographic Citations [J].
Cota, Ricardo G. ;
Ferreira, Anderson A. ;
Nascimento, Cristiano ;
Goncalves, Marcos Andre ;
Laender, Alberto H. F. .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (09) :1853-1870
[5]
Culotta A., 2007, P 6 INT WORKSH INF I, P32
[6]
Diesner J., 2015, Proceedings of the International AAAI Conference on Web and Social Media, P81
[7]
Esperidio L.V.B., 2014, Journal of Information and Data Management, V5, P293
[8]
Fan X., 2011, J DATA INF QUAL, V2, P10, DOI [DOI 10.1145/1891879.1891883, 10.1145/1891879.1891883]
[9]
A Brief Survey of Automatic Methods for Author Name Disambiguation [J].
Ferreira, Anderson A. ;
Goncalves, Marcos Andre ;
Laender, Alberto H. F. .
SIGMOD RECORD, 2012, 41 (02) :15-26
[10]
A tool for generating synthetic authorship records for evaluating author name disambiguation methods [J].
Ferreira, Anderson A. ;
Goncalves, Marcos Andre ;
Almeida, Jussara M. ;
Laender, Alberto H. F. ;
Veloso, Adriano .
INFORMATION SCIENCES, 2012, 206 :42-62