The strength of co-authorship in gene name disambiguation

被引:29
作者
Farkas, Richard [1 ]
机构
[1] Hungarian Acad Sci, Res Grp Artificial Intelligence, Szeged, Hungary
关键词
Mutual Author; Word Sense Disambiguation; Gene Identifier; Test Node; MedLine Abstract;
D O I
10.1186/1471-2105-9-69
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names ( aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) - a special case of Word Sense Disambiguation (WSD) - is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact - one of the special features of biological articles - that the authors of the documents are known through graph-based semi-supervised methods for the GSD task. Results: Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances ( based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse coauthor graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively. Conclusion: Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release ( like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks ( like organism or target disease detection) as well.
引用
收藏
页数:8
相关论文
共 19 条
[1]  
Agirre E, 2006, TEXT SPEECH LANG TEC, V33, P1, DOI 10.1007/978-1-4020-4809-8
[2]   Evolution of the social network of scientific collaborations [J].
Barabási, AL ;
Jeong, H ;
Néda, Z ;
Ravasz, E ;
Schubert, A ;
Vicsek, T .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2002, 311 (3-4) :590-614
[3]  
CHEN L, 2005, BIOINFORMATICS, V21
[4]  
HAKENBERG J, 2007, BIOL TRANSLATIONAL C, P153
[5]   ProMiner: rule-based protein and gene entity recognition [J].
Hanisch, D ;
Fundel, K ;
Mevissen, HT ;
Zimmer, R ;
Fluck, J .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[6]   Overview of BioCreAtIvE task IB: normalized gene lists [J].
Hirschman, L ;
Colosimo, M ;
Morgan, A ;
Yeh, A .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[7]  
LIU H, 2001, J BIOMEDICAL INFORM, V34
[8]  
MAGLOTT DR, 2007, NUCLEIC ACIDS RES, P26
[9]  
MORGAN A, 2007, PAC S BIOCOMPUT
[10]  
PODOWSKI RM, 2004, COMPUTATION SYSTEMS, P415