Towards comprehensive syntactic and semantic annotations of the clinical narrative

被引:70
作者
Albright, Daniel [1 ]
Lanfranchi, Arrick [1 ]
Fredriksen, Anwen [1 ]
Styler, William F. [1 ]
Warner, Colin [2 ]
Hwang, Jena D. [1 ]
Choi, Jinho D. [3 ]
Dligach, Dmitriy [4 ,5 ]
Nielsen, Rodney D. [1 ,6 ]
Martin, James [3 ]
Ward, Wayne [3 ]
Palmer, Martha [1 ]
Savova, Guergana K. [4 ,5 ]
机构
[1] Univ Colorado, Dept Linguist, Boulder, CO 80309 USA
[2] Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA
[3] Univ Colorado, Dept Comp Sci, Boulder, CO 80309 USA
[4] Boston Childrens Hosp, Dept Pediat, Boston, MA USA
[5] Harvard Univ, Sch Med, Boston, MA 02114 USA
[6] Univ N Texas, Dept Comp Sci & Engn, Denton, TX 76203 USA
关键词
Gold Standard Annotations; UMLS; Treebank; Propbank; Natural Language Processing; cTAKES; CORPUS; TEXT;
D O I
10.1136/amiajnl-2012-001317
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods Manual annotation of a clinical narrative corpus of 127606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results The final corpus consists of 13091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891-0.931), NE (0.697-0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.
引用
收藏
页码:922 / 930
页数:9
相关论文
共 43 条
  • [1] [Anonymous], 1993, COMPUT LINGUIST, DOI DOI 10.21236/ADA273556
  • [2] Baker C., 1998, P 36 ANN M ASS COMP, V1, P86, DOI DOI 10.3115/980845.980860
  • [3] Bies A. F., BRACKETING GUIDELINE
  • [4] Bikel D, MULTILINGUAL STAT PA
  • [5] Exploring semantic groups through visual approaches
    Bodenreider, O
    McCray, AT
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (06) : 414 - 432
  • [6] Cairns Brian L, 2011, AMIA Annu Symp Proc, V2011, P171
  • [7] Anaphoric reference in clinical reports: Characteristics of an annotated corpus
    Chapman, Wendy W.
    Savova, Guergana K.
    Zheng, Jiaping
    Tharp, Melissa
    Crowley, Rebecca
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2012, 45 (03) : 507 - 521
  • [8] Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions
    Chapman, Wendy W.
    Nadkarni, Prakash M.
    Hirschman, Lynette
    D'Avolio, Leonard W.
    Savova, Guergana K.
    Uzuner, Ozlem
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2011, 18 (05) : 540 - 543
  • [9] Choi J, 2010, COLLECTIONS MULTILIN, P288
  • [10] Choi Jinho D., 2011, P ACL 2011 WORKSH RE, P37