Five-way smoking status classification using text hot-spot identification and error-correcting output codes

被引:25
作者
Cohen, Aaron M. [1 ]
机构
[1] Oregon Hlth & Sci Univ, Sch Med, Dept Med Informat & Clin Epidemiol, Portland, OR 97239 USA
关键词
D O I
10.1197/jamia.M2434
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.
引用
收藏
页码:32 / 35
页数:4
相关论文
共 9 条
[1]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[2]  
Cohen Aaron M, 2006, AMIA Annu Symp Proc, P161
[3]  
COHEN AM, 2005, P 14 ANN TEXT RETR C
[4]  
Dietterich T. G., 1995, Journal of Artificial Intelligence Research, V2, P263
[5]   Ensemble methods in machine learning [J].
Dietterich, TG .
MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 :1-15
[6]  
GHANI R, 2000, P 17 INT C MACH LEAR, P303
[7]  
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
[8]   Machine learning in automated text categorization [J].
Sebastiani, F .
ACM COMPUTING SURVEYS, 2002, 34 (01) :1-47
[9]  
Vapnik V. N., 2000, The Nature of Statistical Learning Theory