Anaphoric reference in clinical reports: Characteristics of an annotated corpus

被引:11
作者
Chapman, Wendy W. [1 ]
Savova, Guergana K. [2 ,3 ]
Zheng, Jiaping [4 ]
Tharp, Melissa [1 ]
Crowley, Rebecca [5 ]
机构
[1] Univ Calif San Diego, Div Biomed Informat, La Jolla, CA 92093 USA
[2] Childrens Hosp, Boston, MA 02114 USA
[3] Harvard Univ, Sch Med, Boston, MA 02114 USA
[4] Univ Massachusetts, Amherst, MA 01003 USA
[5] Univ Pittsburgh, Dept Biomed Informat, Pittsburgh, PA 15260 USA
关键词
Natural language processing; Clinical reports; INFORMATION; EXTRACTION; SYSTEM;
D O I
10.1016/j.jbi.2012.01.010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Motivation: Expressions that refer to a real-world entity already mentioned in a narrative are often considered anaphoric. For example, in the sentence "The pain comes and goes," the expression "the pain" is probably referring to a previous mention of pain. Interpretation of meaning involves resolving the anaphoric reference: deciding which expression in the text is the correct antecedent of the referring expression, also called an anaphor. We annotated a set of 180 clinical reports (surgical pathology, radiology, discharge summaries, and emergency department) from two institutions to indicate all anaphor-antecedent pairs. Objective: The objective of this study is to describe the characteristics of the corpus in terms of the frequency of anaphoric relations, the syntactic and semantic nature of the members of the pairs, and the types of anaphoric relations that occur. Understanding how anaphoric reference is exhibited in clinical reports is critical to developing reference resolution algorithms and to identifying peculiarities of clinical text that may alter the features and methodologies that will be successful for automated anaphora resolution. Results: We found that anaphoric reference is prevalent in all types of clinical reports, that annotations of noun phrases, semantic type, and section headings may be especially important for automated resolution of anaphoric reference, and that separate modules for reference resolution may be required for different report types, different institutions, and different types of anaphors. Accurate resolution will probably require extensive domain knowledge especially for pathology and radiology reports with more part/whole and set/subset relations. Conclusion: We hope researchers will leverage the annotations in this corpus to develop automated algorithms and will add to the annotations to generate a more extensive corpus. (C) 2012 Elsevier Inc. All rights reserved.
引用
收藏
页码:507 / 521
页数:15
相关论文
共 49 条
  • [1] [Anonymous], 2011, Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task
  • [2] [Anonymous], TINLP78
  • [3] [Anonymous], 2006, Proc. 2006 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. companion, DOI DOI 10.3115/1225785.1225791
  • [4] [Anonymous], 2000, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition
  • [5] Ariel M., 2001, TEXT REPRESENTATION
  • [6] Bagga A, 1998, 2 C DISC AN AN RES D, P28
  • [7] Exploring semantic groups through visual approaches
    Bodenreider, O
    McCray, AT
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (06) : 414 - 432
  • [8] Chapman WW, 2011, AM MED INFORM ASS CL
  • [9] Denis P, 2008, C EMP METH NAT LANG, P660
  • [10] Gerber M, 2010, ACL 2010: 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, P1583