Harmonization of gene/protein annotations: towards a gold standard MEDLINE

被引:7
作者
Campos, David [1 ]
Matos, Sergio [1 ]
Lewin, Ian [2 ]
Oliveira, Jose Luis [1 ]
Rebholz-Schuhmann, Dietrich [2 ]
机构
[1] Univ Aveiro, IEETA DETI, P-3810193 Aveiro, Portugal
[2] European Bioinformat Inst, Cambridge CB10 1SD, England
关键词
GENE; NAMES; INFORMATION; PROTEINS; DATABASE;
D O I
10.1093/bioinformatics/bts125
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The recognition of named entities (NER) is an elementary task in biomedical text mining. A number of NER solutions have been proposed in recent years, taking advantage of available annotated corpora, terminological resources and machine- learning techniques. Currently, the best performing solutions combine the outputs from selected annotation solutions measured against a single corpus. However, little effort has been spent on a systematic analysis of methods harmonizing the annotation results and measuring against a combination of Gold Standard Corpora (GSCs). Results: We present Totum, a machine learning solution that harmonizes gene/protein annotations provided by heterogeneous NER solutions. It has been optimized and measured against a combination of manually curated GSCs. The performed experiments show that our approach improves the F-measure of state-of-the-art solutions by up to 10% (achieving approximate to 70%) in exact alignment and 22% (achieving approximate to 82%) in nested alignment. We demonstrate that our solution delivers reliable annotation results across the GSCs and it is an important contribution towards a homogeneous annotation of MEDLINE abstracts.
引用
收藏
页码:1253 / 1261
页数:9
相关论文
共 34 条
[1]  
Ando R., 2007, P 2 BIOCREATIVE CHAL, P101
[2]  
[Anonymous], 2006, INTRO STAT RELATIONA
[3]  
[Anonymous], 2001, ICML 01 P 18 INT C M
[4]  
[Anonymous], 2004, PROC INT JOINT WORKS
[5]  
[Anonymous], 2009, 2009 AS PAC POW EN E, DOI DOI 10.1109/APPEEC.2009.4918542
[6]  
Boutet E., 2007, Methods Mol Biol, V406
[7]   Comparative experiments on learning information extractors for proteins and their interactions [J].
Bunescu, R ;
Ge, RF ;
Kate, RJ ;
Marcotte, EM ;
Mooney, RJ ;
Ramani, AK ;
Wong, YW .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155
[8]   A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[9]   Data preparation and interannotator agreement: BioCreAtIvE task IB [J].
Colosimo, ME ;
Morgan, AA ;
Yeh, AS ;
Colombe, JB ;
Hirschman, L .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[10]   ChEBI:: a database and ontology for chemical entities of biological interest [J].
Degtyarenko, Kirill ;
de Matos, Paula ;
Ennis, Marcus ;
Hastings, Janna ;
Zbinden, Martin ;
McNaught, Alan ;
Alcantara, Rafael ;
Darsow, Michael ;
Guedj, Mickael ;
Ashburner, Michael .
NUCLEIC ACIDS RESEARCH, 2008, 36 :D344-D350