BoB, a best-of-breed automated text de-identification system for VHA clinical documents

被引：45

作者：

Ferrandez, Oscar ^{[1
,2
]}

South, Brett R. ^{[1
,2
]}

Shen, Shuying ^{[1
,2
]}

Friedlin, F. Jeffrey ^{[3
]}

Samore, Matthew H. ^{[1
,2
]}

Meystre, Stephane M. ^{[1
,2
]}

机构：

[1] Univ Utah, Dept Biomed Informat, Salt Lake City, UT 84112 USA

[2] SLCVA Healthcare Syst, IDEAS Ctr, Salt Lake City, UT USA

[3] Regenstrief Inst Inc, Med Informat, Indianapolis, IN USA

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2013年 / 20卷 / 01期

关键词：

THE-ART; LIBRARY;

D O I：

10.1136/amiajnl-2012-001020

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objective De-identification allows faster and more collaborative clinical research while protecting patient confidentiality. Clinical narrative de-identification is a tedious process that can be alleviated by automated natural language processing methods. The goal of this research is the development of an automated text de-identification system for Veterans Health Administration (VHA) clinical documents. Materials and methods We devised a novel stepwise hybrid approach designed to improve the current strategies used for text de-identification. The proposed system is based on a previous study on the best de-identification methods for VHA documents. This best-of-breed automated clinical text de-identification system (aka BoB) tackles the problem as two separate tasks: (1) maximize patient confidentiality by redacting as much protected health information (PHI) as possible; and (2) leave de-identified documents in a usable state preserving as much clinical information as possible. Results We evaluated BoB with a manually annotated corpus of a variety of VHA clinical notes, as well as with the 2006 i2b2 de-identification challenge corpus. We present evaluations at the instance-and token-level, with detailed results for BoB's main components. Moreover, an existing text de-identification system was also included in our evaluation. Discussion BoB's design efficiently takes advantage of the methods implemented in its pipeline, resulting in high sensitivity values (especially for sensitive PHI categories) and a limited number of false positives. Conclusions Our system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.

引用

页码：77 / 83

页数：7

共 28 条

[1] The MITRE Identification Scrubber Toolkit: Design, training, and assessment [J].

Aberdeen, John ;

Bayer, Samuel ;

Yeniterzi, Reyyan ;

Wellner, Ben ;

Clark, Cheryl ;

Hanauer, David ;

Malin, Bradley ;

Hirschman, Lynette .

INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) :849-859

[2]

[Anonymous], MED IDENTITY THEFT I

[3]

Baldridge J., 2005, OpenNLP maxent package in Java

[4] Development and evaluation of an open source software tool for deidentification of pathology reports [J].

Beckwith B.A. ;

Mahaadevan R. ;

Balis U.J. ;

Kuo F. .

BMC Medical Informatics and Decision Making, 6 (1)

[5] A system for de-identifying medical message board text [J].

Benton, Adrian ;

Hill, Shawndra ;

Ungar, Lyle ;

Chung, Annie ;

Leonard, Charles ;

Freeman, Cristin ;

Holmes, John H. .

BMC BIOINFORMATICS, 2011, 12

[6] LIBSVM: A Library for Support Vector Machines [J].

Chang, Chih-Chung ;

Lin, Chih-Jen .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)

[7]

Doddington G.R., 2004, P 4 INT C LANG RES E, P837

[8]

Fan RE, 2008, J MACH LEARN RES, V9, P1871

[9] Evaluating current automatic de-identification methods with Veteran's health administration clinical documents [J].

Ferrandez, Oscar ;

South, Brett R. ;

Shen, Shuying ;

Friedlin, F. Jeffrey ;

Samore, Matthew H. ;

Meystre, Stephane M. .

BMC MEDICAL RESEARCH METHODOLOGY, 2012, 12

[10]

Finkel J.R., 2005, P 43 ANN M ASS COMP

← 1 2 3 →