A network analysis model for disambiguation of names in lists

被引:70
作者
Malin B. [1 ,2 ]
Airoldi E. [1 ]
Carley K.M. [2 ]
机构
[1] Data Privacy Laboratory, Institute for Software Research International, Carnegie Mellon University, Pittsburgh
[2] Center for the Computational Analysis of Social and Organizational Systems, Institute for Software Research International, Carnegie Mellon University, Pittsburgh
基金
美国安德鲁·梅隆基金会; 美国国家科学基金会;
关键词
Clustering; Disambiguation; Link analysis; Random walks; Social networks;
D O I
10.1007/s10588-005-3940-3
中图分类号
学科分类号
摘要
In research and application, social networks are increasingly extracted from relationships inferred by name collocations in text-based documents. Despite the fact that names represent real entities, names are not unique identifiers and it is often unclear when two name observations correspond to the same underlying entity. One confounder stems from ambiguity, in which the same name correctly references multiple entities. Prior name disambiguation methods measured similarity between two names as a function of their respective documents. In this paper, we propose an alternative similarity metric based on the probability of walking from one ambiguous name to another in a random walk of the social network constructed from all documents. We experimentally validate our model on actor-actor relationships derived from the Internet Movie Database. Using a global similarity threshold, we demonstrate random walks achieve a significant increase in disambiguation capability in comparison to prior models. © 2005 Springer Science + Business Media, Inc.
引用
收藏
页码:119 / 139
页数:20
相关论文
共 49 条
[1]
Adamic L., Adar E., Friends and Neighbors on the Web, Social Networks, 25, 3, pp. 211-230, (2003)
[2]
Airoldi E., Slavkovic A., Fienberg S., Interactive Tetrahedron Applet: A Tool for Exploring the Geometry of 2 × 2 Contingency Tables, Department of Statistics Technical Report CMU-STAT-05-824, (2005)
[3]
Airoldi E., Malin B., Data Mining Challenges for Electronic Safety: The Case of Fraudulent Intent Detection in E-mails, Proceedings of the IEEE Workshop on Privacy and Security Aspects of Data Mining, pp. 57-66, (2004)
[4]
Albert R., Barabasi A.L., Statistical Mechanics of Complex Networks, Reviews of Modern Physics, 74, pp. 47-97, (2002)
[5]
Bagga A., Baldwin B., Entity-based Cross-Document Coreferencing Using the Vector Space Model, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pp. 79-85, (1998)
[6]
Banko M., Brill E., Scaling to Very Large Corpora for Natural Language Disambiguation, Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 26-33, (2001)
[7]
Barabasi A.L., Albert R., Emergence of Scaling in Random Networks, Science, 286, pp. 509-512, (1999)
[8]
Bekkerman R., McCallum A., Disambiguating Web Appearances of People in a Social Network, Proceedings of the 2005 World Wide Web Conference, (2005)
[9]
Bhattacharya I., Getoor L., Iterative Record Linkage for Cleaning and Integration, Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 11-18, (2004)
[10]
Bhattacharya I., Getoor L., Deduplication and Group Detection Using Links, Proceedings of the 2004 ACM SIGKDD Workshop on Link Analysis and Group Detection, (2004)