Identifying non-elliptical entity mentions in a coordinated NP with ellipses

被引:6
作者
Chae, Jeongmin [1 ]
Jung, Younghee [1 ]
Lee, Taemin [1 ]
Jung, Soonyoung [1 ]
Huh, Chan [1 ]
Kim, Gilhan [1 ]
Kim, Hyeoncheol [1 ]
Oh, Heungbum [2 ,3 ]
机构
[1] Korea Univ, Dept Comp Sci Educ, Seoul, South Korea
[2] Asan Med Ctr, Dept Lab Med, Asan, South Korea
[3] Univ Ulsan, Coll Med, Ulsan, South Korea
基金
新加坡国家研究基金会;
关键词
Ellipsis resolution; Named entity recognition; Text mining; GENE;
D O I
10.1016/j.jbi.2013.10.002
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
Named entities in the biomedical domain are often written using a Noun Phrase (NP) along with a coordinating conjunction such as 'and' and 'or'. In addition, repeated words among named entity mentions are frequently omitted. It is often difficult to identify named entities. Although various Named Entity Recognition (NER) methods have tried to solve this problem, these methods can only deal with relatively simple elliptical patterns in coordinated NPs. We propose a new NER method for identifying non-elliptical entity mentions with simple or complex ellipses using linguistic rules and an entity mention dictionary. The GENIA and CRAFT corpora were used to evaluate the performance of the proposed system. The GENIA corpus was used to evaluate the performance of the system according to the quality of the dictionary. The GENIA corpus comprises 3434 non-elliptical entity mentions in 1585 coordinated NPs with ellipses. The system achieves 92.11% precision, 95.20% recall, and 93.63% F-score in identification of non-elliptical entity mentions in coordinated NPs. The accuracy of the system in resolving simple and complex ellipses is 94.54% and 91.95%, respectively. The CRAFT corpus was used to evaluate the performance of the system under realistic conditions. The system achieved 78.47% precision, 67.10% recall, and 72.34% F-score in coordinated NPs. The performance evaluations of the system show that it efficiently solves the problem caused by ellipses, and improves NER performance. The algorithm is implemented in PHP and the code can be downloaded from https://code.google.com/p/medtextmining/. (C) 2013 Published by Elsevier Inc.
引用
收藏
页码:139 / 152
页数:14
相关论文
共 24 条
[1]
Agarwal Rajeev., 1992, Proceedings of the 30th annual meeting on Association for Computational Linguistics, P15
[2]
Agrawal R., P 20 INT C VERY LARG
[3]
[Anonymous], 2001, PROC 18 INT C MACH L
[4]
Concept annotation in the CRAFT corpus [J].
Bada, Michael ;
Eckert, Miriam ;
Evans, Donald ;
Garcia, Kristin ;
Shipley, Krista ;
Sitnikov, Dmitry ;
Baumgartner, William A., Jr. ;
Cohen, K. Bretonnel ;
Verspoor, Karin ;
Blake, Judith A. ;
Hunter, Lawrence E. .
BMC BIOINFORMATICS, 2012, 13
[5]
The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[6]
Bies Ann., 2005, CORPUSANNO 05, P21
[7]
Buyko E., 2007, PACLING 2007 P 10 C, P163
[8]
Chae Jeongmin, 2011, [The Journal of Korean Association of Computer Education, 컴퓨터교육학회 논문지], V14, P83
[9]
Chantree F, 2005, P REC ADV NAT LANG P, P21
[10]
HIGH-PRECISION BIOLOGICAL EVENT EXTRACTION: EFFECTS OF SYSTEM AND OF DATA [J].
Cohen, K. Bretonnel ;
Verspoor, Karin ;
Johnson, Helen L. ;
Roeder, Chris ;
Ogren, Philip V. ;
Baumgartner, William A., Jr. ;
White, Elizabeth ;
Tipney, Hannah ;
Hunter, Lawrence .
COMPUTATIONAL INTELLIGENCE, 2011, 27 (04) :681-701