Boosting drug named entity recognition using an aggregate classifier

被引：40

作者：

Korkontzelos, Ioannis ^{[1
]}

Piliouras, Dimitrios ^{[1
]}

Dowsey, Andrew W. ^{[2
,3
,4
]}

Ananiadou, Sophia ^{[1
]}

机构：

[1] Univ Manchester, Manchester Inst Biotechnol, Natl Ctr Text Min NaCTeM, Sch Comp Sci, Manchester M1 7DN, Lancs, England

[2] Univ Manchester, Ctr Endocrinol & Diabet, Inst Human Dev, Fac Med & Human Sci, Manchester, Lancs, England

[3] Univ Manchester, CADET, Manchester M13 9WL, Lancs, England

[4] Cent Manchester Univ Hosp NHS Fdn Trust, Manchester Acad Hlth Sci Ctr, Manchester M13 9WL, Lancs, England

来源：

ARTIFICIAL INTELLIGENCE IN MEDICINE | 2015年 / 65卷 / 02期

基金：

英国工程与自然科学研究理事会;

关键词：

Named entity annotation sparsity; Gold-standard vs. silver-standard annotations; Named entity recogniser aggregation; Genetic-programming-evolved string-similarity patterns; Drug named entity recognition; DICTIONARY;

D O I：

10.1016/j.artmed.2015.05.007

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Objective: Drug named entity recognition (NER) is a critical step for complex biomedical NLP tasks such as the extraction of pharmacogenomic, pharmacodynamic and pharmacokinetic parameters. Large quantities of high quality training data are almost always a prerequisite for employing supervised machine-learning techniques to achieve high classification performance. However, the human labour needed to produce and maintain such resources is a significant limitation. In this study, we improve the performance of drug NER without relying exclusively on manual annotations. Methods: We perform drug NER using either a small gold-standard corpus (120 abstracts) or no corpus at all. In our approach, we develop a voting system to combine a number of heterogeneous models, based on dictionary knowledge, gold-standard corpora and silver annotations, to enhance performance. To improve recall, we employed genetic programming to evolve 11 regular-expression patterns that capture common drug suffixes and used them as an extra means for recognition. Materials: Our approach uses a dictionary of drug names, i.e. DrugBank, a small manually annotated corpus, i.e. the pharmacokinetic corpus, and a part of the UKPMC database, as raw biomedical text. Gold-standard and silver annotated data are used to train maximum entropy and multinomial logistic regression classifiers. Results: Aggregating drug NER methods, based on gold-standard annotations, dictionary knowledge and patterns, improved the performance on models trained on gold-standard annotations, only, achieving a maximum F-score of 95%. In addition, combining models trained on silver annotations, dictionary knowledge and patterns are shown to achieve comparable performance to models trained exclusively on gold-standard data. The main reason appears to be the morphological similarities shared among drug names. Conclusion: We conclude that gold-standard data are not a hard requirement for drug NER. Combining heterogeneous models build on dictionary knowledge can achieve similar or comparable classification performance with that of the best performing model trained on gold-standard annotations. (C) 2015 The Authors. Published by Elsevier B.V.

引用

页码：145 / 153

页数：9

共 61 条

[1]

Al-Kofahi K., 2001, Proceedings of the 2001 ACM CIKM. Tenth International Conference on Information and Knowledge Management, P97, DOI 10.1145/502585.502603

[2]

Ananiadou Sophia., 2005, Text Mining for Biology And Biomedicine

[3]

[Anonymous], 1992, COLING 1992, DOI DOI 10.3115/992133.992154

[4]

[Anonymous], 1998, 17 INT C COMP LING

[5]

[Anonymous], 1957, The Perceptron, a Perceiving and Recognizing Automaton Project Para

[6]

[Anonymous], THESIS AM U WASHINGT

[7]

[Anonymous], 2008, A field guide to genetic programming, DOI DOI 10.1007/S10710-008-9073-Y

[8] An overview of MetaMap: historical perspective and recent advances [J].

Aronson, Alan R. ;

Lang, Francois-Michel .

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (03) :229-236

[9] The combination of text classifiers using reliability indicators [J].

Bennett, PN ;

Dumais, ST ;

Horvitz, E .

INFORMATION RETRIEVAL, 2005, 8 (01) :67-100

[10]

Berger AL, 1996, COMPUT LINGUIST, V22, P39

← 1 2 3 4 5 6 7 →