tmChem: a high performance approach for chemical named entity recognition and normalization

被引:245
作者
Leaman, Robert [1 ]
Wei, Chih-Hsuan [1 ]
Lu, Zhiyong [1 ]
机构
[1] Natl Ctr Biotechnol Informat, 8600 Rockville Pike, Bethesda, MD 20894 USA
关键词
Marginal Probability; Conditional Random Field; Allopregnanolone; Entity Recognition; Binary Feature;
D O I
10.1186/1758-2946-7-S1-S3
中图分类号
O6 [化学];
学科分类号
070301 [无机化学];
摘要
Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature. We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks. We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance. The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem. The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator
引用
收藏
页数:10
相关论文
共 38 条
[1]
Buyko E., 2007, PACLING 2007 P 10 C, P163
[2]
Identifying non-elliptical entity mentions in a coordinated NP with ellipses [J].
Chae, Jeongmin ;
Jung, Younghee ;
Lee, Taemin ;
Jung, Soonyoung ;
Huh, Chan ;
Kim, Gilhan ;
Kim, Hyeoncheol ;
Oh, Heungbum .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 47 :139-152
[3]
Technical milestone - Medical subject headings used to search the biomedical literature [J].
Coletti, MH ;
Bleich, HL .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2001, 8 (04) :317-323
[4]
Dogan R.I., 2012, P 2012 WORKSH BIOM N, P91
[5]
Understanding PubMed® user search behavior through log analysis [J].
Dogan, Rezarta Islamaj ;
Murray, G. Craig ;
Neveol, Aurelie ;
Lu, Zhiyong .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2009,
[6]
Chemical named entities recognition: a review on approaches and applications [J].
Eltyeb, Safaa ;
Salim, Naomie .
JOURNAL OF CHEMINFORMATICS, 2014, 6
[7]
Hastie T, 2009, The elements of statistical learning: Data mining, inference, and prediction, DOI [10.1007/978-0-387-21606-5, DOI 10.1007/978-0-387-84858-7]
[8]
A dictionary to identify small molecules and drugs in free text [J].
Hettne, Kristina M. ;
Stierum, Rob H. ;
Schuemie, Martijn J. ;
Hendriksen, Peter J. M. ;
Schijvenaars, Bob J. A. ;
van Mulligen, Erik M. ;
Kleinjans, Jos ;
Kors, Jan A. .
BIOINFORMATICS, 2009, 25 (22) :2983-2991
[9]
Integrating high dimensional bi-directional parsing models for gene mention tagging [J].
Hsu, Chun-Nan ;
Chang, Yu-Ming ;
Kuo, Cheng-Ju ;
Lin, Yu-Shi ;
Huang, Han-Shen ;
Chung, I-Fang .
BIOINFORMATICS, 2008, 24 (13) :I286-I294
[10]
Hunter LawrenceE., 2009, The Processes of Life: An Introduction to Molecular Biology