Two supervised learning approaches for name disambiguation in author citations

被引:168
作者
Han, H [1 ]
Giles, L [1 ]
Zha, H [1 ]
Li, C [1 ]
Tsioutsiouliklis, K [1 ]
机构
[1] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
来源
JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT | 2004年
关键词
naive Bayes; name disambiguation; Support Vector Machine;
D O I
10.1145/996350.996419
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper investigates two supervised teaming approaches to disambiguate authors in the citations'. One approach uses the naive Bayes probability model, a generative model; the other uses Support Vector Machines(SVMs) [39] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: co-author names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the web, mainly publication lists from homepages, the other collected from the DBLP citation databases.
引用
收藏
页码:296 / 305
页数:10
相关论文
共 41 条
[1]  
[Anonymous], 1999, P 22 ANN INT ACM SIG
[2]  
[Anonymous], P 16 ANN INT ACM SIG
[3]  
Baker L. D., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P96, DOI 10.1145/290941.290970
[4]  
Banerjee A., 2003, ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, P19, DOI DOI 10.1145/956750.956757
[5]  
BANERJEE S, P 3 INT C INT TEXT P
[6]  
Bar-Shalom Y., 1988, Tracking and Data Association
[7]   Adaptive name matching in information integration [J].
Bilenko, M ;
Mooney, R ;
Cohen, W ;
Ravikumar, P ;
Fienberg, S .
IEEE INTELLIGENT SYSTEMS, 2003, 18 (05) :16-23
[8]   DUPLICATE RECORD ELIMINATION IN LARGE DATA FILES [J].
BITTON, D ;
DEWITT, DJ .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1983, 8 (02) :255-265
[9]  
BRANTING LK, 2002, J INFORMATION LAW TE, P1
[10]  
Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328