A study of active learning methods for named entity recognition in clinical text

被引:83
作者
Chen, Yukun [1 ]
Lasko, Thomas A. [1 ]
Mei, Qiaozhu [3 ,4 ]
Denny, Joshua C. [1 ,2 ]
Xu, Hua [1 ,5 ]
机构
[1] Vanderbilt Univ, Sch Med, Dept Biomed Informat, Nashville, TN 37212 USA
[2] Vanderbilt Univ, Sch Med, Dept Med, Nashville, TN 37212 USA
[3] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
[4] Univ Michigan, Dept Elect Engn & Comp Sci, Ann Arbor, MI 48109 USA
[5] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, Houston, TX 77030 USA
基金
美国国家卫生研究院;
关键词
Active learning; Machine learning; Clinical natural language processing; Clinical named entity recognition; ELECTRONIC HEALTH RECORDS; MEDICATION INFORMATION; PRE-ANNOTATION; EXTRACTION; CLASSIFICATION; ASSERTIONS;
D O I
10.1016/j.jbi.2015.09.010
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
Objectives: Named entity recognition (NER), a sequential labeling task, is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance, but they often require large amounts of annotated samples, which are expensive to build due to the requirement of domain experts in annotation. Active learning (AL), a sample selection approach integrated with supervised ML, aims to minimize the annotation cost while maximizing the performance of ML-based models. In this study, our goal was to develop and evaluate both existing and new AL methods for a clinical NER task to identify concepts of medical problems, treatments, and lab tests from the clinical notes. Methods: Using the annotated NER corpus from the 2010 i2b2/VA NLP challenge that contained 349 clinical documents with 20,423 unique sentences, we simulated AL experiments using a number of existing and novel algorithms in three different categories including uncertainty-based, diversity-based, and baseline sampling strategies. They were compared with the passive learning that uses random sampling. Learning curves that plot performance of the NER model against the estimated annotation cost (based on number of sentences or words in the training set) were generated to evaluate different active learning and the passive learning methods and the area under the learning curve (ALC) score was computed. Results: Based on the learning curves of F-measure vs. number of sentences, uncertainty sampling algorithms outperformed all other methods in ALC. Most diversity-based methods also performed better than random sampling in ALC. To achieve an F-measure of 0.80, the best method based on uncertainty sampling could save 66% annotations in sentences, as compared to random sampling. For the learning curves of F-measure vs. number of words, uncertainty sampling methods again outperformed all other methods in ALC. To achieve 0.80 in F-measure, in comparison to random sampling, the best uncertainty based method saved 42% annotations in words. But the best diversity based method reduced only 7% annotation effort. Conclusion: In the simulated setting, AL methods, particularly uncertainty-sampling based approaches, seemed to significantly save annotation cost for the clinical NER task. The actual benefit of active learning in clinical NER should be further evaluated in a real-time setting. (C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:11 / 18
页数:8
相关论文
共 47 条
[1]
[Anonymous], 2004, ADV NEURAL INFORM PR
[2]
[Anonymous], J MACH LEARN RES
[3]
[Anonymous], OASIS ONLINE ACTIVE
[4]
[Anonymous], MeSH
[5]
[Anonymous], 1999, MAXIMUM ENTROPY APPR
[6]
[Anonymous], 2010, Active Learning Challenge
[7]
[Anonymous], 2009, Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
[8]
An overview of MetaMap: historical perspective and recent advances [J].
Aronson, Alan R. ;
Lang, Francois-Michel .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (03) :229-236
[9]
Applying active learning to high-throughput phenotyping algorithms for electronic health records data [J].
Chen, Yukun ;
Carroll, Robert J. ;
Hinz, Eugenia R. McPeek ;
Shah, Anushi ;
Eyler, Anne E. ;
Denny, Joshua C. ;
Xu, Hua .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (E2) :E253-E259
[10]
Applying active learning to supervised word sense disambiguation in MEDLINE [J].
Chen, Yukun ;
Cao, Hongxin ;
Mei, Qiaozhu ;
Zheng, Kai ;
Xu, Hua .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (05) :1001-1006