Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

被引:60
作者
Xu, Yan [1 ,2 ]
Hong, Kai [1 ,3 ]
Tsujii, Junichi [1 ]
Chang, Eric I-Chao [1 ]
机构
[1] Microsoft Res Asia, Beijing 100080, Peoples R China
[2] Beihang Univ, Minist Educ, State Key Lab Software Dev Environm, Key Lab Biomech & Mechanobiol, Beijing, Peoples R China
[3] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
D O I
10.1136/amiajnl-2011-000776
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective A system that translates narrative text in the medical domain into structured representation is in great demand. The system performs three sub-tasks: concept extraction, assertion classification, and relation identification. Design The overall system consists of five steps: (1) pre-processing sentences, (2) marking noun phrases (NPs) and adjective phrases (APs), (3) extracting concepts that use a dosage-unit dictionary to dynamically switch two models based on Conditional Random Fields (CRF), (4) classifying assertions based on voting of five classifiers, and (5) identifying relations using normalized sentences with a set of effective discriminating features. Measurements Macro-averaged and micro-averaged precision, recall and F-measure were used to evaluate results. Results The performance is competitive with the state-of-the-art systems with micro-averaged F-measure of 0.8489 for concept extraction, 0.9392 for assertion classification and 0.7326 for relation identification. Conclusions The system exploits an array of common features and achieves state-of-the-art performance. Prudent feature engineering sets the foundation of our systems. In concept extraction, we demonstrated that switching models, one of which is especially designed for telegraphic sentences, improved extraction of the treatment concept significantly. In assertion classification, a set of features derived from a rule-based classifier were proven to be effective for the classes such as conditional and possible. These classes would suffer from data scarcity in conventional machine-learning methods. In relation identification, we use two-staged architecture, the second of which applies pairwise classifiers to possible candidate classes. This architecture significantly improves performance.
引用
收藏
页码:824 / 832
页数:9
相关论文
共 26 条
[1]   Event extraction for systems biology by text mining the literature [J].
Ananiadou, Sophia ;
Pyysalo, Sampo ;
Tsujii, Jun'ichi ;
Kell, Douglas B. .
TRENDS IN BIOTECHNOLOGY, 2010, 28 (07) :381-390
[2]  
[Anonymous], 2001, P 18 INT C MACHINE L
[3]   Logistic regression in the medical literature: Standards for use and reporting, with particular attention to one medical domain [J].
Bagley, SC ;
White, H ;
Golomb, BA .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2001, 54 (10) :979-985
[4]  
Chan Y.S., 2010, P 23 INT C COMP LING, P152
[5]  
Clark A, 2003, EACL 2003: 10TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P59
[6]   Extracting medical information from narrative patient records: the case of medication-related information [J].
Deleger, Louise ;
Grouin, Cyril ;
Zweigenbaum, Pierre .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (05) :555-558
[7]  
Duda R.O., 2000, Pattern classification
[8]   A GENERAL NATURAL-LANGUAGE TEXT PROCESSOR FOR CLINICAL RADIOLOGY [J].
FRIEDMAN, C ;
ALDERSON, PO ;
AUSTIN, JHM ;
CIMINO, JJ ;
JOHNSON, SB .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1994, 1 (02) :161-174
[9]   Assessment of commercial NLP engines for medication information extraction from dictated clinical notes [J].
Jayannathan, V. ;
Mullett, Charles J. ;
Arbogast, James G. ;
Halbritter, Kevin A. ;
Yellaprayada, Deepthi ;
Regulapati, Sushmitha ;
Bandaru, Pavani .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2009, 78 (04) :284-291
[10]  
Meystre S M, 2008, Yearb Med Inform, P128