De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

被引：28

作者：

Dalianis, Hercules ^{[1
]}

Velupillai, Sumithra ^{[1
]}

机构：

[1] Stockholm Univ Forum, DSV, Dept Comp & Syst Sci, S-16440 Kista, Sweden

来源：

JOURNAL OF BIOMEDICAL SEMANTICS | 2010年 / 1卷

关键词：

Conditional Random Field; Partial Match; Name Entity Recognition; Protect Health Information; Annotation Class;

D O I：

10.1186/2041-1480-1-6

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background: In order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident. Results: We present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators. Conclusions: Our intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

引用

页数：10

共 18 条

[1] Inter-Coder Agreement for Computational Linguistics [J].

Artstein, Ron ;

Poesio, Massimo .

COMPUTATIONAL LINGUISTICS, 2008, 34 (04) :555-596

[2] Active learning with statistical models [J].

Cohn, DA ;

Ghahramani, Z ;

Jordan, MI .

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1996, 4 :129-145

[3]

Dalianis H., 2009, Proceedings of ISHIMR, P243

[4]

Di Eugenio B, 2004, COMPUT LINGUIST, V30, P95, DOI 10.1162/089120104773633402

[5]

Finkel J.R., 2005, P 43 ANN M ASS COMP

[6]

Fort K., 2009, P 3 LING ANN WORKSH, P142, DOI DOI 10.3115/1698381.1698406

[7]

Grishman R, 1996, P 16 INT C COMP LING, P466, DOI [DOI 10.3115/992628.992709, 10.3115/992628.992709]

[8] Testing Tactics to Localize De-Identification [J].

Grouin, Cyril ;

Rosier, Arnaud ;

Dameron, Olivier ;

Zweigenbaum, Pierre .

MEDICAL INFORMATICS IN A UNITED AND HEALTHY EUROPE, 2009, 150 :735-739

[9] Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research [J].

Gupta, D ;

Saul, M ;

Gilbertson, J .

AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2004, 121 (02) :176-186

[10]

Kohavi R., 1995, IJCAI-95. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, P1137

← 1 2 →