A document processing pipeline for annotating chemical entities in scientific documents

被引:14
作者
Campos, David [1 ]
Matos, Sergio [2 ]
Oliveira, Jose L. [2 ]
机构
[1] BMD Software Lda, Rua Calouste Gulbenkian 1, P-3810074 Aveiro, Portugal
[2] Univ Aveiro, DETI IEETA, P-3810193 Aveiro, Portugal
来源
JOURNAL OF CHEMINFORMATICS | 2015年 / 7卷
关键词
DISCOVERY; DATABASE; DRUGS;
D O I
10.1186/1758-2946-7-S1-S7
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: The recognition of drugs and chemical entities in text is a very important task within the field of biomedical information extraction, given the rapid growth in the amount of published texts (scientific papers, patents, patient records) and the relevance of these and other related concepts. If done effectively, this could allow exploiting such textual resources to automatically extract or infer relevant information, such as drug profiles, relations and similarities between drugs, or associations between drugs and potential drug targets. The objective of this work was to develop and validate a document processing and information extraction pipeline for the identification of chemical entity mentions in text. Results: We used the BioCreative IV CHEMDNER task data to train and evaluate a machine-learning based entity recognition system. Using a combination of two conditional random field models, a selected set of features, and a post-processing stage, we achieved F-measure results of 87.48% in the chemical entity mention recognition task and 87.75% in the chemical document indexing task. Conclusions: We present a machine learning-based solution for automatic recognition of chemical and drug names in scientific documents. The proposed approach applies a rich feature set, including linguistic, orthographic, morphological, dictionary matching and local context features. Post-processing modules are also integrated, performing parentheses correction, abbreviation resolution and filtering erroneous mentions using an exclusion list derived from the training data. The developed methods were implemented as a document annotation tool and web service, freely available at http://bioinformatics.ua.pt/becas-chemicals/.
引用
收藏
页数:10
相关论文
共 27 条
[1]  
[Anonymous], 2007, P EMP METH NAT LANG, DOI DOI 10.1093/nar/gkt441
[2]   Concept annotation in the CRAFT corpus [J].
Bada, Michael ;
Eckert, Miriam ;
Evans, Donald ;
Garcia, Kristin ;
Shipley, Krista ;
Sitnikov, Dmitry ;
Baumgartner, William A., Jr. ;
Cohen, K. Bretonnel ;
Verspoor, Karin ;
Blake, Judith A. ;
Hunter, Lawrence E. .
BMC BIOINFORMATICS, 2012, 13
[3]  
Campos D., 2013, CURRENT METHODOLOGIE, P839
[4]   A modular framework for biomedical concept recognition [J].
Campos, David ;
Matos, Sergio ;
Oliveira, Jose Luis .
BMC BIOINFORMATICS, 2013, 14
[5]   Gimli: open source and high-performance biomedical name recognition [J].
Campos, David ;
Matos, Sergio ;
Oliveira, Jose Luis .
BMC BIOINFORMATICS, 2013, 14
[6]   Harmonization of gene/protein annotations: towards a gold standard MEDLINE [J].
Campos, David ;
Matos, Sergio ;
Lewin, Ian ;
Oliveira, Jose Luis ;
Rebholz-Schuhmann, Dietrich .
BIOINFORMATICS, 2012, 28 (09) :1253-1261
[7]  
Campos David, 2012, Theory Appl. Adv. Text Min, DOI DOI 10.5772/51066
[8]  
Corbett P., 2007, Biological, Translational, and Clinical Language Processing, P57
[9]   Cascaded classifiers for confidence-based chemical named entity recognition [J].
Corbett, Peter ;
Copestake, Ann .
BMC BIOINFORMATICS, 2008, 9 (Suppl 11)
[10]   Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks [J].
Davis, Allan Peter ;
Murphy, Cynthia G. ;
Saraceni-Richards, Cynthia A. ;
Rosenstein, Michael C. ;
Wiegers, Thomas C. ;
Mattingly, Carolyn J. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D786-D792