Dynamic author name disambiguation for growing digital libraries

被引:41
作者
Qian, Yanan [1 ]
Zheng, Qinghua [1 ]
Sakai, Tetsuya [2 ]
Ye, Junting [1 ]
Liu, Jun [1 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Peoples R China
[2] Waseda Univ, Dept Comp Sci & Engn, Tokyo, Japan
来源
INFORMATION RETRIEVAL JOURNAL | 2015年 / 18卷 / 05期
基金
国家高技术研究发展计划(863计划); 美国国家科学基金会;
关键词
Digital library; Author disambiguation; Data stream; Clustering; Multi-classification;
D O I
10.1007/s10791-015-9261-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
When a digital library user searches for publications by an author name, she often sees a mixture of publications by different authors who have the same name. With the growth of digital libraries and involvement of more authors, this author ambiguity problem is becoming critical. Author disambiguation (AD) often tries to solve this problem by leveraging metadata such as coauthors, research topics, publication venues and citation information, since more personal information such as the contact details is often restricted or missing. In this paper, we study the problem of how to efficiently disambiguate author names given an incessant stream of published papers. To this end, we propose a "BatchAD+IncAD" framework for dynamic author disambiguation. First, we perform batch author disambiguation (BatchAD) to disambiguate all author names at a given time by grouping all records (each record refers to a paper with one of its author names) into disjoint clusters. This establishes a one-to-one mapping between the clusters and real-world authors. Then, for newly added papers, we periodically perform incremental author disambiguation (IncAD), which determines whether each new record can be assigned to an existing cluster, or to a new cluster not yet included in the previous data. Based on the new data, IncAD also tries to correct previous AD results. Our main contributions are: (1) We demonstrate with real data that a small number of new papers often have overlapping author names with a large portion of existing papers, so it is challenging for IncAD to effectively leverage previous AD results. (2) We propose a novel IncAD model which aggregates metadata from a cluster of records to estimate the author's profile such as her coauthor distributions and keyword distributions, in order to predict how likely it is that a new record is "produced" by the author. (3) Using two labeled datasets and one large-scale raw dataset, we show that the proposed method is much more efficient than state-of-the-art methods while ensuring high accuracy.
引用
收藏
页码:379 / 412
页数:34
相关论文
共 39 条
[1]
A comparison of extrinsic clustering evaluation metrics based on formal constraints [J].
Amigo, Enrique ;
Gonzalo, Julio ;
Artiles, Javier ;
Verdejo, Felisa .
INFORMATION RETRIEVAL, 2009, 12 (04) :461-486
[2]
[Anonymous], INT SEM WEB C
[3]
[Anonymous], 2007, ACM Transactions on Knowledge Discovery from Data (TKDD), DOI [DOI 10.1145/1217299.1217304, 10.1145/1217299.1217304]
[4]
[Anonymous], 2011, Journal of Information and Data Management
[5]
[Anonymous], 2008, P 17 ACM C INF KNOWL, DOI DOI 10.1145/1458082.1458327
[6]
Resolving Person Names in Web People Search [J].
Balog, Krisztian ;
Azzopardi, Leif ;
de Rijke, Maarten .
WEAVING SERVICES AND PEOPLE ON THE WORLD WIDE WEB, 2009, :301-+
[7]
Bollen J., 2007, Proceedings of the 16th International Conference on World Wide Web, P1247
[8]
Byung-won O., 2007, SIAM INT C DAT MIN
[9]
Bootstrapping Active Name Disambiguation with Crowdsourcing [J].
Cheng, Yu ;
Chen, Zhengzhang ;
Wang, Jiang ;
Agrawal, Ankit ;
Choudhary, Alok .
PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, :1213-1216
[10]
Culotta A., 2007, WORKSH INF INT WEB W