The strength of co-authorship in gene name disambiguation

被引：29

作者：

Farkas, Richard ^{[1
]}

机构：

[1] Hungarian Acad Sci, Res Grp Artificial Intelligence, Szeged, Hungary

来源：

BMC BIOINFORMATICS | 2008年 / 9卷 / 1期

关键词：

Mutual Author; Word Sense Disambiguation; Gene Identifier; Test Node; MedLine Abstract;

D O I：

10.1186/1471-2105-9-69

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names ( aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) - a special case of Word Sense Disambiguation (WSD) - is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact - one of the special features of biological articles - that the authors of the documents are known through graph-based semi-supervised methods for the GSD task. Results: Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances ( based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse coauthor graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively. Conclusion: Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release ( like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks ( like organism or target disease detection) as well.

引用

页数：8

共 19 条

[1]

Agirre E, 2006, TEXT SPEECH LANG TEC, V33, P1, DOI 10.1007/978-1-4020-4809-8

[2] Evolution of the social network of scientific collaborations [J].

Barabási, AL ;

Jeong, H ;

Néda, Z ;

Ravasz, E ;

Schubert, A ;

Vicsek, T .

PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2002, 311 (3-4) :590-614

[3]

CHEN L, 2005, BIOINFORMATICS, V21

[4]

HAKENBERG J, 2007, BIOL TRANSLATIONAL C, P153

[5] ProMiner: rule-based protein and gene entity recognition [J].

Hanisch, D ;

Fundel, K ;

Mevissen, HT ;

Zimmer, R ;

Fluck, J .

BMC BIOINFORMATICS, 2005, 6 (Suppl 1)

[6] Overview of BioCreAtIvE task IB: normalized gene lists [J].

Hirschman, L ;

Colosimo, M ;

Morgan, A ;

Yeh, A .

BMC BIOINFORMATICS, 2005, 6 (Suppl 1)

[7]

LIU H, 2001, J BIOMEDICAL INFORM, V34

[8]

MAGLOTT DR, 2007, NUCLEIC ACIDS RES, P26

[9]

MORGAN A, 2007, PAC S BIOCOMPUT

[10]

PODOWSKI RM, 2004, COMPUTATION SYSTEMS, P415

← 1 2 →