The MITRE Identification Scrubber Toolkit: Design, training, and assessment

被引:90
作者
Aberdeen, John [1 ]
Bayer, Samuel [1 ]
Yeniterzi, Reyyan [2 ]
Wellner, Ben [1 ]
Clark, Cheryl [1 ]
Hanauer, David [3 ]
Malin, Bradley [4 ]
Hirschman, Lynette [1 ]
机构
[1] Mitre Corp, Bedford, MA 01730 USA
[2] Carnegie Mellon Univ, Sch Comp Sci, Language Technol Inst, Pittsburgh, PA 15213 USA
[3] Univ Michigan, Ctr Comprehens Canc, Ann Arbor, MI 48109 USA
[4] Vanderbilt Univ, Sch Med, Dept Biomed Informat, Nashville, TN 37212 USA
关键词
Privacy; De-identification; Natural language processing; Electronic health records; OF-THE-ART; INFORMATION; PRIVACY;
D O I
10.1016/j.ijmedinf.2010.09.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose: Medical records must often be stripped of patient identifiers, or de-identified, before being shared. De-identification by humans is time-consuming, and existing software is limited in its generality. The open source MITRE Identification Scrubber Toolkit (MIST) provides an environment to support rapid tailoring of automated de-identification to different document types, using automatically learned classifiers to de-identify and protect sensitive information. Methods: MIST was evaluated with four classes of patient records from the Vanderbilt University Medical Center: discharge summaries, laboratory reports, letters, and order summaries. We trained and tested MIST on each class of record separately, as well as on pooled sets of records. We measured precision, recall, F-measure and accuracy at the word level for the detection of patient identifiers as designated by the HIPAA Safe Harbor Rule. Results: MIST was applied to medical records that differed in the amounts and types of protected health information (PHI): lab reports contained only two types of PHI (dates, names) compared to discharge summaries, which were much richer. Performance of the de-identification tool depended on record class; F-measure results were 0.996 for order summaries, 0.996 for discharge summaries, 0.943 for letters and 0.934 for laboratory reports. Experiments suggest the tool requires several hundred training exemplars to reach an F-measure of at least 0.9. Conclusions: The MIST toolkit makes possible the rapid tailoring of automated de-identification to particular document types and supports the transition of the de-identification software to medical end users, avoiding the need for developers to have access to original medical records. We are making the MIST toolkit available under an open source license to encourage its application to diverse data sets at multiple institutions. (C) 2010 Elsevier Ireland Ltd. All rights reserved.
引用
收藏
页码:849 / 859
页数:11
相关论文
共 24 条
[1]  
[Anonymous], 2000, FED REGISTER, V65, P82462
[2]  
[Anonymous], 2004, The digital person: Technology and privacy in the information age
[3]  
[Anonymous], 2004, P INT JOINT WORKSH N
[4]  
[Anonymous], THESIS BRANDEIS U WA
[5]   Evaluating re-identification risks with respect to the HIPAA privacy rule [J].
Benitez, Kathleen ;
Malin, Bradley .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (02) :169-177
[6]  
Efron B, 1986, STAT SCI, V1, DOI DOI 10.1214/SS/1177013815
[7]   A software tool for removing patient identifying information from clinical documents [J].
Friedlin, F. Jeff ;
McDonald, Clement J. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2008, 15 (05) :601-610
[8]  
Galley M., 2006, P 2006 C EMP METH NA, P364
[9]   Testing Tactics to Localize De-Identification [J].
Grouin, Cyril ;
Rosier, Arnaud ;
Dameron, Olivier ;
Zweigenbaum, Pierre .
MEDICAL INFORMATICS IN A UNITED AND HEALTHY EUROPE, 2009, 150 :735-739
[10]   Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research [J].
Gupta, D ;
Saul, M ;
Gilbertson, J .
AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2004, 121 (02) :176-186