Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy

被引:21
作者
Alexopoulou, Dimitra [1 ]
Andreopoulos, Bill [1 ]
Dietze, Heiko [1 ]
Doms, Andreas [1 ]
Gandon, Fabien [2 ]
Hakenberg, Joerg [1 ]
Khelif, Khaled [2 ]
Schroeder, Michael [1 ]
Waechter, Thomas [1 ]
机构
[1] Tech Univ Dresden, Biotechnol Ctr BIOTEC, D-01062 Dresden, Germany
[2] INRIA Sophia Antipolis, F-06902 Sophia Antipolis, France
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
GENE ONTOLOGY; NAME DISAMBIGUATION; NATURAL-LANGUAGE; DOMAIN; SIMILARITY; ANNOTATION; TEXT;
D O I
10.1186/1471-2105-10-28
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. Results: The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Conclusion: Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. Availability: The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
引用
收藏
页数:15
相关论文
共 51 条
[1]  
Agirre E, 2006, TEXT SPEECH LANG TEC, V33, P1, DOI 10.1007/1-4020-4809-2_1
[2]   Terminologies for text-mining;: an experiment in the lipoprotein metabolism domain [J].
Alexopoulou, Dimitra ;
Waechter, Thomas ;
Pickersgill, Laura ;
Eyre, Cecilia ;
Schroeder, Michael .
BMC BIOINFORMATICS, 2008, 9 (Suppl 4)
[3]   Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering [J].
Andreopoulos, Bill ;
Alexopoulou, Dimitra ;
Schroeder, Michael .
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2008, 2 (03) :193-215
[4]  
[Anonymous], 1993, P WORKSH HUM LANG TE, DOI DOI 10.3115/1075671.1075731
[5]  
[Anonymous], 2005, P ISMB 2005 SIG M BI
[6]  
[Anonymous], 1995, arXiv
[7]  
[Anonymous], P 17 INT C MACH LEAR
[8]  
[Anonymous], P 4 ANN S DOC AN INF
[9]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[10]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39