Ontology-guided feature engineering for clinical text classification

被引:88
作者
Garla, Vijay N. [1 ]
Brandt, Cynthia [2 ,3 ]
机构
[1] Yale Univ, Interdept Program Computat Biol & Bioinformat, New Haven, CT 06520 USA
[2] Connecticut VA Healthcare Syst, West Haven, CT 06516 USA
[3] Yale Univ, Yale Ctr Med Informat, New Haven, CT 06520 USA
关键词
Natural language processing; Semantic similarity; Feature selection; Kernel methods; Information gain; Information content; SEMANTIC SIMILARITY; EXTRACTION; SELECTION;
D O I
10.1016/j.jbi.2012.04.010
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
In this study we present novel feature engineering techniques that leverage the biomedical domain knowledge encoded in the Unified Medical Language System (UMLS) to improve machine-learning based clinical text classification. Critical steps in clinical text classification include identification of features and passages relevant to the classification task, and representation of clinical text to enable discrimination between documents of different classes. We developed novel information-theoretic techniques that utilize the taxonomical structure of the Unified Medical language System (UMLS) to improve feature ranking, and we developed a semantic similarity measure that projects clinical text into a feature space that improves classification. We evaluated these methods on the 2008 Integrating Informatics with Biology and the Bedside (I2B2) obesity challenge. The methods we developed improve upon the results of this challenge's top machine-learning based system, and may improve the performance of other machine-learning based clinical text classification systems. We have released all tools developed as part of this study as open source, available at http://code.google.com/p/ytex. (C) 2012 Elsevier Inc. All rights reserved.
引用
收藏
页码:992 / 998
页数:7
相关论文
共 29 条
[1]
A System for Classifying Disease Comorbidity Status from Medical Discharge Summaries Using Automated Hotspot and Negated Concept Detection [J].
Ambert, Kyle H. ;
Cohen, Aaron M. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2009, 16 (04) :590-595
[2]
[Anonymous], 1997, ICML
[3]
Semi-structured document categorization with a semantic kernel [J].
Aseervatham, Sujeevan ;
Bennani, Younes .
PATTERN RECOGNITION, 2009, 42 (09) :2067-2076
[4]
Batet M, 2010, J BIOMED INFORM
[5]
Bekkerman R., 2003, Journal of Machine Learning Research, V3, P1183, DOI 10.1162/153244303322753625
[6]
Bloehdorn S., 2009, Handbook on Ontologies
[7]
Bloehdorn S, 2007, LECT NOTES COMPUT SC, V4425, P307
[8]
Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271
[9]
A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[10]
LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)