The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes

被引:166
作者
Vincze, Veronika [1 ]
Szarvas, Gyoergy [1 ]
Farkas, Richard [2 ]
Mora, Gyoergy [1 ]
Csirik, Janos [2 ]
机构
[1] Univ Szeged, Dept Informat, Human Language Technol Grp, Szeged, Hungary
[2] Hungarian Acad Sci, Res Grp Artificial Intelligence, Szeged, Hungary
关键词
D O I
10.1186/1471-2105-9-S11-S9
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Detecting uncertain and negative assertions is essential in most BioMedical Text Mining tasks where, in general, the aim is to derive factual knowledge from textual data. This article reports on a corpus annotation project that has produced a freely available resource for research on handling negation and uncertainty in biomedical texts (we call this corpus the BioScope corpus). Results: The corpus consists of three parts, namely medical free texts, biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief linguist - also responsible for setting up the annotation guidelines - who resolved cases where the annotators disagreed. The resulting corpus consists of more than 20.000 sentences that were considered for annotation and over 10% of them actually contain one (or more) linguistic annotation suggesting negation or uncertainty. Conclusion: Statistics are reported on corpus size, ambiguity levels and the consistency of annotations. The corpus is accessible for academic purposes and is free of charge. Apart from the intended goal of serving as a common resource for the training, testing and comparing of biomedical Natural Language Processing systems, the corpus is also a good resource for the linguistic analysis of scientific and clinical texts.
引用
收藏
页数:9
相关论文
共 12 条
[1]   A simple algorithm for identifying negated findings and diseases in discharge summaries [J].
Chapman, WW ;
Bridewell, W ;
Hanbury, P ;
Cooper, GF ;
Buchanan, BG .
JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (05) :301-310
[2]  
Collier N., 1999, P EACL 99
[3]   A controlled trial of automated classification of negation from clinical notes [J].
Elkin P.L. ;
Brown S.H. ;
Bauer B.A. ;
Husser C.S. ;
Carruth W. ;
Bergstrom L.R. ;
Wahner-Roedler D.L. .
BMC Medical Informatics and Decision Making, 5 (1)
[4]   A GENERAL NATURAL-LANGUAGE TEXT PROCESSOR FOR CLINICAL RADIOLOGY [J].
FRIEDMAN, C ;
ALDERSON, PO ;
AUSTIN, JHM ;
CIMINO, JJ ;
JOHNSON, SB .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1994, 1 (02) :161-174
[5]   A novel hybrid approach to automated negation detection in clinical radiology reports [J].
Huang, Yang ;
Lowe, Henry J. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2007, 14 (03) :304-311
[6]  
Hyland K., 1994, English for Specific Purposes, V13, P239, DOI DOI 10.1016/0889-4906(94)90004-3
[7]   Corpus annotation for mining biomedical events from literature [J].
Kim, Jin-Dong ;
Ohta, Tomoko ;
Tsujii, Jun'ichi .
BMC BIOINFORMATICS, 2008, 9 (1)
[8]  
LIGHT M, 2004, P HLT NAACL 2004 WOR, P17
[9]  
MEDLOCK B, 2007, P 45 ANN M ASS COMP, P992
[10]   Use of general-purpose negation detection to augment concept indexing of medical documents: A quantitative study using the UMLS [J].
Mutalik, PG ;
Deshpande, A ;
Nadkarni, PM .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2001, 8 (06) :598-609