Self-Training Author Name Disambiguation for Information Scarce Scenarios

被引:35
作者
Ferreira, Anderson A. [1 ]
Veloso, Adriano [2 ]
Goncalves, Marcos Andre [2 ]
Laender, Alberto H. F. [2 ]
机构
[1] Univ Fed Ouro Preto, Dept Computacao, BR-35400000 Ouro Preto, MG, Brazil
[2] Univ Fed Minas Gerais, Dept Ciencia Computacao, BR-31270010 Belo Horizonte, MG, Brazil
关键词
MODEL;
D O I
10.1002/asi.22992
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
We present a novel 3-step self-training method for author name disambiguation-SAND (self-training associative name disambiguator)-which requires no manual labeling, no parameterization (in real-world scenarios) and is particularly suitable for the common situation in which only the most basic information about a citation record is available (i.e., author names, and work and venue titles). During the first step, real-world heuristics on coauthors are able to produce highly pure (although fragmented) clusters. The most representative of these clusters are then selected to serve as training data for the third supervised author assignment step. The third step exploits a state-of-the-art transductive disambiguation method capable of detecting unseen authors not included in any training example and incorporating reliable predictions to the training data. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation, demonstrate that our proposed method outperforms all representative unsupervised author grouping disambiguation methods and is very competitive with fully supervised author assignment methods. Thus, different from other bootstrapping methods that explore privileged, hard to obtain information such as self-citations and personal information, our proposed method produces topnotch performance with no (manual) training data or parameterization and in the presence of scarce information.
引用
收藏
页码:1257 / 1278
页数:22
相关论文
共 48 条
[1]
Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
[2]
[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
[3]
[Anonymous], 2011, Journal of Information and Data Management
[4]
[Anonymous], 2010, P 10 ANN JOINT C DIG, DOI 10.1145/1816123.1816130
[5]
Bhattacharya Indrajit, 2006, P 6 SIAM INT C DAT M
[6]
Bordes A, 2005, J MACH LEARN RES, V6, P1579
[7]
LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]
SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[9]
An Unsupervised Heuristic-Based Hierarchical Method for Name Disambiguation in Bibliographic Citations [J].
Cota, Ricardo G. ;
Ferreira, Anderson A. ;
Nascimento, Cristiano ;
Goncalves, Marcos Andre ;
Laender, Alberto H. F. .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (09) :1853-1870
[10]
Culotta Aron, 2007, P INT WORKSH INF INT