Building a semantically annotated corpus of clinical texts

被引:93
作者
Roberts, Angus [1 ]
Gaizauskas, Robert [1 ]
Hepple, Mark [1 ]
Demetriou, George [1 ]
Guo, Yikun [1 ]
Roberts, Ian [1 ]
Setzer, Andrea [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England
基金
英国医学研究理事会;
关键词
Corpora; Semantic annotation; Clinical text; Natural language processing; Gold standards; Evaluation; Information extraction; Text mining; Temporal annotation; Annotation guidelines;
D O I
10.1016/j.jbi.2008.12.013
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains. (C) 2009 Elsevier Inc. All rights reserved.
引用
收藏
页码:950 / 966
页数:17
相关论文
共 40 条
[1]
ALEX B, 2008, P LREC 2008 WORKSH B, P11
[2]
[Anonymous], 2003, P 5 INT WORKSH COMP
[3]
[Anonymous], 2006, Proc. 2006 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. companion, DOI DOI 10.3115/1225785.1225791
[4]
Cunningham H, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P168
[5]
Demetriou G, 2008, SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, P3420
[6]
Understanding medical school curriculum content using KnowledgeMap [J].
Denny, JC ;
Smithers, JD ;
Miller, RA ;
Spickard, A .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2003, 10 (04) :351-362
[7]
A controlled trial of automated classification of negation from clinical notes [J].
Elkin P.L. ;
Brown S.H. ;
Bauer B.A. ;
Husser C.S. ;
Carruth W. ;
Bergstrom L.R. ;
Wahner-Roedler D.L. .
BMC Medical Informatics and Decision Making, 5 (1)
[8]
Protein names and how to find them [J].
Franzén, K ;
Eriksson, G ;
Olsson, F ;
Asker, L ;
Lidén, P ;
Cöster, J .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) :49-61
[9]
Friedman C, 1998, METHOD INFORM MED, V37, P334
[10]
The evolution of Protege:: an environment for knowledge-based systems development [J].
Gennari, JH ;
Musen, MA ;
Fergerson, RW ;
Grosso, WE ;
Crubézy, M ;
Eriksson, H ;
Noy, NF ;
Tu, SW .
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2003, 58 (01) :89-123