Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs

被引:20
作者
Hanauer, David [1 ]
Aberdeen, John [2 ]
Bayer, Samuel [2 ]
Wellner, Benjamin [2 ]
Clark, Cheryl [2 ]
Zheng, Kai [3 ,4 ]
Hirschman, Lynette [2 ]
机构
[1] Univ Michigan, Dept Pediat, Ann Arbor, MI 48109 USA
[2] Mitre Corp, Bedford, MA 01730 USA
[3] Univ Michigan, Dept Hlth Management & Policy, Sch Publ Hlth, Ann Arbor, MI 48109 USA
[4] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
基金
美国国家卫生研究院;
关键词
Privacy [101.880.604.473.352.500; Natural language processing [L01.224.065.580; NLP; Electronic health records [E05.318.308.940.968.625.500; Medical record systems Computerized [E05.318.308.940.968.625; Medical informatics [L01.313.500; HEALTH INFORMATION-TECHNOLOGY; OF-THE-ART; FREE-TEXT; CLINICAL TEXT;
D O I
10.1016/j.ijmedinf.2013.03.005
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose: We describe an experiment to build a de-identification system for clinical records using the open source MITRE Identification Scrubber Toolldt (MIST). We quantify the human annotation effort needed to produce a system that de-identifies at high accuracy. Methods: Using two types of clinical records (history and physical notes, and social work notes), we iteratively built statistical de-identification models by annotating 10 notes, training a model, applying the model to another 10 notes, correcting the model's output, and training from the resulting larger set of annotated notes. This was repeated for 20 rounds of 10 notes each, and then an additional 6 rounds of 20 notes each, and a final round of 40 notes. At each stage, we measured precision, recall, and F-score, and compared these to the amount of annotation time needed to complete the round. Results: After the initial 10-note round (33 min of annotation time) we achieved an F-score of 0.89. After just over 8h of annotation time (round 21) we achieved an F-score of 0.95. Number of annotation actions needed, as well as time needed, decreased in later rounds as model performance improved. Accuracy on history and physical notes exceeded that of social work notes, suggesting that the wider variety and contexts for protected health information (PHI) in social work notes is more difficult to model. Conclusions: It is possible, with modest effort, to build a functioning de-identification system de novo using the MIST framework. The resulting system achieved performance comparable to other high-performing de-identification systems. (C) 2013 Elsevier Ireland Ltd. All rights reserved.
引用
收藏
页码:821 / 831
页数:11
相关论文
共 33 条
[1]   The MITRE Identification Scrubber Toolkit: Design, training, and assessment [J].
Aberdeen, John ;
Bayer, Samuel ;
Yeniterzi, Reyyan ;
Wellner, Ben ;
Clark, Cheryl ;
Hanauer, David ;
Malin, Bradley ;
Hirschman, Lynette .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) :849-859
[2]  
[Anonymous], 2000, FED REGISTER, V65, P82462
[3]  
[Anonymous], 2008, P 22 INT C COMP LING
[4]  
Arnott Smith Catherine, 2003, AMIA Annu Symp Proc, P614
[5]   The Future Of Health Information Technology In The Patient-Centered Medical Home [J].
Bates, David W. ;
Bitton, Asaf .
HEALTH AFFAIRS, 2010, 29 (04) :614-621
[6]  
Blumenthal D., 2010, EHR ADOPTION SET SOA
[7]   Stimulating the Adoption of Health Information Technology. [J].
Blumenthal, David .
NEW ENGLAND JOURNAL OF MEDICINE, 2009, 360 (15) :1477-1479
[8]  
Bria W., 2000, CLIN INFORM, P103
[9]   Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text [J].
Carrell, David ;
Malin, Bradley ;
Aberdeen, John ;
Bayer, Samuel ;
Clark, Cheryl ;
Wellner, Ben ;
Hirschman, Lynette .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (02) :342-348
[10]   De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields [J].
Dalianis, Hercules ;
Velupillai, Sumithra .
JOURNAL OF BIOMEDICAL SEMANTICS, 2010, 1