Applying active learning to high-throughput phenotyping algorithms for electronic health records data

被引:70
作者
Chen, Yukun [1 ]
Carroll, Robert J. [1 ]
Hinz, Eugenia R. McPeek [2 ,3 ]
Shah, Anushi [1 ]
Eyler, Anne E. [4 ]
Denny, Joshua C. [1 ,4 ]
Xu, Hua [1 ,5 ]
机构
[1] Vanderbilt Univ, Sch Med, Dept Biomed Informat, Nashville, TN 37212 USA
[2] Duke Univ, Med Ctr, Dept Med, Durham, NC 27710 USA
[3] Duke Univ, Med Ctr, Dept Pediat, Durham, NC 27710 USA
[4] Vanderbilt Univ, Sch Med, Dept Med, Nashville, TN 37212 USA
[5] Univ Texas Hlth Sci Ctr Houston, Sch Biomed Informat, Houston, TX 77030 USA
关键词
RANDOMIZED CONTROLLED-TRIAL; MEDICAL-RECORDS; RHEUMATOID-ARTHRITIS; EXTRACTION SYSTEM; INFORMATION; DISCOVERY; LIBRARY; RISK;
D O I
10.1136/amiajnl-2013-001945
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms. Methods We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling. Results Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples. Conclusions This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.
引用
收藏
页码:E253 / E259
页数:7
相关论文
共 43 条
[1]  
Aliferis CF, 2010, J MACH LEARN RES, V11, P171
[2]   Portability of an algorithm to identify rheumatoid arthritis in electronic health records [J].
Carroll, Robert J. ;
Thompson, Will K. ;
Eyler, Anne E. ;
Mandelin, Arthur M. ;
Cai, Tianxi ;
Zink, Raquel M. ;
Pacheco, Jennifer A. ;
Boomershine, Chad S. ;
Lasko, Thomas A. ;
Xu, Hua ;
Karlson, Elizabeth W. ;
Perez, Raul G. ;
Gainer, Vivian S. ;
Murphy, Shawn N. ;
Ruderman, Eric M. ;
Pope, Richard M. ;
Plenge, Robert M. ;
Kho, Abel Ngo ;
Liao, Katherine P. ;
Denny, Joshua C. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (E1) :E162-E169
[3]  
Carroll Robert J, 2011, AMIA Annu Symp Proc, V2011, P189
[4]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5]   Understanding medical school curriculum content using KnowledgeMap [J].
Denny, JC ;
Smithers, JD ;
Miller, RA ;
Spickard, A .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2003, 10 (04) :351-362
[6]   Chapter 13: Mining Electronic Health Records in the Genomics Era [J].
Denny, Joshua C. .
PLOS COMPUTATIONAL BIOLOGY, 2012, 8 (12)
[7]   Variants Near FOXE1 Are Associated with Hypothyroidism and Other Thyroid Conditions: Using Electronic Medical Records for Genome- and Phenome-wide Studies [J].
Denny, Joshua C. ;
Crawford, Dana C. ;
Ritchie, Marylyn D. ;
Bielinski, Suzette J. ;
Basford, Melissa A. ;
Bradford, Yuki ;
Chai, High Seng ;
Bastarache, Lisa ;
Zuvich, Rebecca ;
Peissig, Peggy ;
Carrell, David ;
Ramirez, Andrea H. ;
Pathak, Jyotishman ;
Wilke, Russell A. ;
Rasmussen, Luke ;
Wang, Xiaoming ;
Pacheco, Jennifer A. ;
Kho, Abel N. ;
Hayes, M. Geoffrey ;
Weston, Noah ;
Matsumoto, Martha ;
Kopp, Peter A. ;
Newton, Katherine M. ;
Jarvik, Gail P. ;
Li, Rongling ;
Manolio, Teri A. ;
Kullo, Iftikhar J. ;
Chute, Christopher G. ;
Chisholm, Rex L. ;
Larson, Eric B. ;
McCarty, Catherine A. ;
Masys, Daniel R. ;
Roden, Dan M. ;
de Andrade, Mariza .
AMERICAN JOURNAL OF HUMAN GENETICS, 2011, 89 (04) :529-542
[8]   Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents [J].
Denny, Joshua C. ;
Spickard, Anderson, III ;
Johnson, Kevin B. ;
Peterson, Neeraja B. ;
Peterson, Josh F. ;
Miller, Randolph A. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2009, 16 (06) :806-815
[9]  
Elkin PL, 2001, J AM MED INFORM ASSN, P159
[10]  
Fan RE, 2008, J MACH LEARN RES, V9, P1871