Learning statistical models of phenotypes using noisy labeled training data

被引:92
作者
Agarwal, Vibhu [1 ]
Podchiyska, Tanya [1 ]
Banda, Juan M. [2 ]
Goel, Veena [3 ,4 ]
Leung, Tiffany I. [5 ]
Minty, Evan P. [1 ,6 ]
Sweeney, Timothy E. [1 ,7 ]
Gyang, Elsie [8 ]
Shah, Nigam H. [2 ]
机构
[1] Stanford Univ, Biomed Informat Training Program, Med Sch Off Bldg,1265 Welch Rd, Stanford, CA 94305 USA
[2] Stanford Univ, Stanford Ctr Biomed Informat Res, Stanford, CA 94305 USA
[3] Stanford Univ, Dept Pediat, Sch Med, Stanford, CA 94305 USA
[4] Stanford Childrens Hlth, Dept Clin Informat, Stanford, CA 94305 USA
[5] Stanford Univ, Div Gen Med Disciplines, Stanford, CA 94305 USA
[6] Univ Calgary, Fac Med, Calgary, AB T2N 4N1, Canada
[7] Stanford Hosp & Clin, Dept Surg, Stanford, CA 94305 USA
[8] Stanford Hosp & Clin, Div Vasc Surg, Stanford, CA 94305 USA
基金
美国国家卫生研究院;
关键词
Electronic health record; phenotyping; noisy labels; machine learning; high throughput; ELECTRONIC HEALTH RECORDS; MEDICAL-RECORDS; EMERGE NETWORK; ALGORITHMS; ASSOCIATION; CHALLENGES;
D O I
10.1093/jamia/ocw028
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record. Methods We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard. Results Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively. We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach. Conclusions Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.
引用
收藏
页码:1166 / 1173
页数:8
相关论文
共 47 条
[1]   Biomedical negation scope detection with conditional random fields [J].
Agarwal, Shashank ;
Yu, Hong .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (06) :696-701
[2]  
Agarwal V, 2014, 1 WORKSH DAT MIN MED
[3]   On the sample complexity of noise-tolerant learning [J].
Aslam, JA ;
Decatur, SE .
INFORMATION PROCESSING LETTERS, 1996, 57 (04) :189-195
[4]  
Barlas Stephen, 2011, P T, V36, P37
[5]   Challenges in Using Electronic Health Record Data for CER Experience of 4 Learning Organizations and Solutions Applied [J].
Bayley, K. Bruce ;
Belnap, Tom ;
Savitz, Lucy ;
Masica, Andrew L. ;
Shah, Nilay ;
Fleming, Neil S. .
MEDICAL CARE, 2013, 51 (08) :S80-S86
[6]   Defining a comprehensive verotype using electronic health records for personalized medicine [J].
Boland, Mary Regina ;
Hripcsak, George ;
Shen, Yufeng ;
Chung, Wendy K. ;
Weng, Chunhua .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (E2) :E232-E238
[7]   Using Natural Language Processing to Improve Efficiency of Manual Chart Abstraction in Research: The Case of Breast Cancer Recurrence [J].
Carrell, David S. ;
Halgrim, Scott ;
Diem-Thy Tran ;
Buist, Diana S. M. ;
Chubak, Jessica ;
Chapman, Wendy W. ;
Savova, Guergana .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2014, 179 (06) :749-758
[8]   Intelligent use and clinical benefits of electronic health records in rheumatoid arthritis [J].
Carroll, Robert J. ;
Eyler, Anne E. ;
Denny, Joshua C. .
EXPERT REVIEW OF CLINICAL IMMUNOLOGY, 2015, 11 (03) :329-337
[9]   Portability of an algorithm to identify rheumatoid arthritis in electronic health records [J].
Carroll, Robert J. ;
Thompson, Will K. ;
Eyler, Anne E. ;
Mandelin, Arthur M. ;
Cai, Tianxi ;
Zink, Raquel M. ;
Pacheco, Jennifer A. ;
Boomershine, Chad S. ;
Lasko, Thomas A. ;
Xu, Hua ;
Karlson, Elizabeth W. ;
Perez, Raul G. ;
Gainer, Vivian S. ;
Murphy, Shawn N. ;
Ruderman, Eric M. ;
Pope, Richard M. ;
Plenge, Robert M. ;
Kho, Abel Ngo ;
Liao, Katherine P. ;
Denny, Joshua C. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (E1) :E162-E169
[10]   Evaluation of matched control algorithms in EHR-based phenotyping studies: A case study of inflammatory bowel disease comorbidities [J].
Castro, Victor M. ;
Apperson, W. Kay ;
Gainer, Vivian S. ;
Ananthakrishnan, Ashwin N. ;
Goodson, Alyssa P. ;
Wang, Taowei D. ;
Herrick, Christopher D. ;
Murphy, Shawn N. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 52 :105-111