De-identification of health records using Anonym: Effectiveness and robustness across datasets

被引:18
作者
Zuccon, Guido [1 ,2 ]
Kotzur, Daniel [1 ]
Nguyen, Anthony [1 ]
Bergheim, Anton [3 ]
机构
[1] Royal Brisbane & Womens Hosp, Commonwealth Sci & Ind Res Org, Australian E Hlth Res Ctr, Herston, Qld 4029, Australia
[2] Queensland Univ Technol, Sch Informat Syst, Brisbane, Qld, Australia
[3] Canc Inst NSW, Australian Technol Pk, Eveleigh, NSW 2015, Australia
关键词
Conditional random fields; Pattern matching; De-identification; Health records; SYSTEM;
D O I
10.1016/j.artmed.2014.03.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Objective: Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. Methods and materials: The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Results: Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Conclusion: Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright (C) 2014 Published by Elsevier B.V. All rights reserved.
引用
收藏
页码:145 / 151
页数:7
相关论文
共 18 条
[1]  
[Anonymous], 2001, PROC 18 INT C MACH L
[2]  
Bostrom H., 2012, ICML WORKSH MACH LEA, P1
[3]  
Collins M, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P489
[4]  
Darr DA, 2006, METHOD INFORM MED, V45, P246
[5]   What can natural language processing do for clinical decision support? [J].
Demner-Fushman, Dina ;
Chapman, Wendy W. ;
McDonald, Clement J. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) :760-772
[6]   BoB, a best-of-breed automated text de-identification system for VHA clinical documents [J].
Ferrandez, Oscar ;
South, Brett R. ;
Shen, Shuying ;
Friedlin, F. Jeffrey ;
Samore, Matthew H. ;
Meystre, Stephane M. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) :77-83
[7]   Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs [J].
Hanauer, David ;
Aberdeen, John ;
Bayer, Samuel ;
Wellner, Benjamin ;
Clark, Cheryl ;
Zheng, Kai ;
Hirschman, Lynette .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2013, 82 (09) :821-831
[8]   Automatic de-identification of textual documents in the electronic health record: a review of recent research [J].
Meystre, Stephane M. ;
Friedlin, F. Jeffrey ;
South, Brett R. ;
Shen, Shuying ;
Samore, Matthew H. .
BMC MEDICAL RESEARCH METHODOLOGY, 2010, 10
[9]  
Nadeau D, 2007, LINGUIST INVESTIG, V30, P3
[10]   Privacy and the use of health data for research [J].
O'Keefe, Christine M. ;
Connolly, Chris J. .
MEDICAL JOURNAL OF AUSTRALIA, 2010, 193 (09) :537-541