A probabilistic similarity metric for Medline records: A model for author name disambiguation

被引:104
作者
Torvik, VI
Weeber, M
Swanson, DR
Smalheiser, NR
机构
[1] Univ Illinois, Dept Psychiat, Chicago, IL 60612 USA
[2] Univ Chicago, Div Humanities, Chicago, IL 60637 USA
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2005年 / 56卷 / 02期
基金
日本学术振兴会;
关键词
D O I
10.1002/asi.20105
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a model for estimating the probability that a pair of author names (sharing last name and first initial), appearing on two different Medline articles, refer to the same individual. The model uses a simple yet powerful similarity profile between a pair of articles, based on title, journal name, coauthor names, medical subject headings (MeSH), language, affiliation, and name attributes (prevalence in the literature, middle initial, and suffix). The similarity profile distribution is computed from reference sets consisting of pairs of articles containing almost exclusively author matches versus nonmatches, generated in an unbiased manner. Although the match set is generated automatically and might contain a small proportion of nonmatches, the model is quite robust against contamination with nonmatches. We have created a free, public service ("Author-ity": http://arrowsmith.psych.uic.edu) that takes as input an author's name given on a specific article, and gives as output a list of all articles with that (last name, first initial) ranked by decreasing similarity, with match probability indicated.
引用
收藏
页码:140 / 158
页数:19
相关论文
共 28 条
[1]  
CHURCHES T, 2002, BMC MED INFORMATICS
[2]  
French JC, 2000, J AM SOC INFORM SCI, V51, P774, DOI 10.1002/(SICI)1097-4571(2000)51:8<774::AID-ASI90>3.0.CO
[3]  
2-P
[4]  
GARFIELD E, 1979, CITATION INDEXING
[5]  
Grossman JW., 2002, Congressus Numerantium, V158, P202
[6]  
Holmes D. I., 2001, Literary & Linguistic Computing, V16, P403, DOI 10.1093/llc/16.4.403
[7]  
Jain K, 1988, Algorithms for clustering data
[8]  
JUDSON DH, 2002, ANN M CLASS SOC N AM
[9]   Chameleon: Hierarchical clustering using dynamic modeling [J].
Karypis, G ;
Han, EH ;
Kumar, V .
COMPUTER, 1999, 32 (08) :68-+
[10]  
Lawrence S., 1999, Proceedings of the Third International Conference on Autonomous Agents, P392, DOI 10.1145/301136.301255