Boosting drug named entity recognition using an aggregate classifier

被引:40
作者
Korkontzelos, Ioannis [1 ]
Piliouras, Dimitrios [1 ]
Dowsey, Andrew W. [2 ,3 ,4 ]
Ananiadou, Sophia [1 ]
机构
[1] Univ Manchester, Manchester Inst Biotechnol, Natl Ctr Text Min NaCTeM, Sch Comp Sci, Manchester M1 7DN, Lancs, England
[2] Univ Manchester, Ctr Endocrinol & Diabet, Inst Human Dev, Fac Med & Human Sci, Manchester, Lancs, England
[3] Univ Manchester, CADET, Manchester M13 9WL, Lancs, England
[4] Cent Manchester Univ Hosp NHS Fdn Trust, Manchester Acad Hlth Sci Ctr, Manchester M13 9WL, Lancs, England
基金
英国工程与自然科学研究理事会;
关键词
Named entity annotation sparsity; Gold-standard vs. silver-standard annotations; Named entity recogniser aggregation; Genetic-programming-evolved string-similarity patterns; Drug named entity recognition; DICTIONARY;
D O I
10.1016/j.artmed.2015.05.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. Methods: We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. Materials: Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. Results: Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. Conclusion: We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations. (C) 2015 The Authors. Published by Elsevier B.V.
引用
收藏
页码:145 / 153
页数:9
相关论文
共 61 条
[11]   Combining rough decisions for intelligent text mining using Dempster's rule [J].
Bi, Yaxin ;
McClean, Sally ;
Anderson, Terry .
ARTIFICIAL INTELLIGENCE REVIEW, 2006, 26 (03) :191-209
[12]   An algorithm that learns what's in a name [J].
Bikel, DM ;
Schwartz, R ;
Weischedel, RM .
MACHINE LEARNING, 1999, 34 (1-3) :211-231
[13]  
Bjorne J, 2013, PROC 7 INT WORKSHOP, P651
[14]  
Campbell G., 1987, ASA PROC SECT STAT G, V1, P10
[15]   Automated acquisition of disease-drug knowledge from biomedical and clinical documents: An initial study [J].
Chen, Elizabeth S. ;
Hripcsak, George ;
Xu, Hua ;
Markatou, Marianthi ;
Friedman, Carol .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2008, 15 (01) :87-98
[16]  
Chen S.F., 1996, P ACL
[17]  
Chinchor N., 1998, P 7 MESS UND C MUC 7
[18]   A survey of current work in biomedical text mining [J].
Cohen, AM ;
Hersh, WR .
BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) :57-71
[19]  
Collins Michael., 1995, Proceedings of the Third Workshop on Very Large Corpora, P27, DOI DOI 10.1177/0075424211421346
[20]  
Dagan I., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P150