Using rule-based natural language processing to improve disease normalization in biomedical text

被引:112
作者
Kang, Ning [1 ]
Singh, Bharat [1 ]
Afzal, Zubair [1 ]
van Mulligen, Erik M. [1 ]
Kors, Jan A. [1 ]
机构
[1] Erasmus Univ, Med Ctr, Dept Med Informat, NL-3000 CA Rotterdam, Netherlands
关键词
Text mining; Biomedical concept identification; Natural language processing; Dictionary-based system; Rule-based system; GENE; EXTRACTION; PROTEIN; IDENTIFICATION; TERMS; TASK;
D O I
10.1136/amiajnl-2012-001173
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Background and objective In order for computers to extract useful information from unstructured text, a concept normalization system is needed to link relevant concepts in a text to sources that contain further information about the concept. Popular concept normalization tools in the biomedical field are dictionary-based. In this study we investigate the usefulness of natural language processing (NLP) as an adjunct to dictionary-based concept normalization. Methods We compared the performance of two biomedical concept normalization systems, MetaMap and Peregrine, on the Arizona Disease Corpus, with and without the use of a rule-based NLP module. Performance was assessed for exact and inexact boundary matching of the system annotations with those of the gold standard and for concept identifier matching. Results Without the NLP module, MetaMap and Peregrine attained F-scores of 61.0% and 63.9%, respectively, for exact boundary matching, and 55.1% and 56.9% for concept identifier matching. With the aid of the NLP module, the F-scores of MetaMap and Peregrine improved to 73.3% and 78.0% for boundary matching, and to 66.2% and 69.8% for concept identifier matching. For inexact boundary matching, performances further increased to 85.5% and 85.4%, and to 73.6% and 73.3% for concept identifier matching. Conclusions We have shown the added value of NLP for the recognition and normalization of diseases with MetaMap and Peregrine. The NLP module is general and can be applied in combination with any concept normalization system. Whether its use for concept types other than disease is equally advantageous remains to be investigated.
引用
收藏
页码:876 / 881
页数:6
相关论文
共 44 条
[1]
[Anonymous], 2008, P WORKSH ENH INT LAR
[2]
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[3]
Atzeni P, 2011, P 9 EUR C EV COMP MA, P27
[4]
Bada Michael., 2010, Proceedings of the Fourth Linguistic Annotation Workshop, P207
[5]
Concept recognition for extracting protein interaction relations from biomedical text [J].
Baumgartner, William A., Jr. ;
Lu, Zhiyong ;
Johnson, Helen L. ;
Caporaso, J. Gregory ;
Paquette, Jesse ;
Lindemann, Anna ;
White, Elizabeth K. ;
Medvedeva, Olga ;
Cohen, K. Bretonnel ;
Hunter, Lawrence .
GENOME BIOLOGY, 2008, 9
[6]
The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[7]
Buyko E, 2006, P JOINT BIOLINKBIOON, P2
[8]
Buyko E., 2007, PACLING 2007 P 10 C, P163
[9]
New challenges for biological text-mining in the next decade [J].
Dai H.-J. ;
Chang Y.-C. ;
Tzong-Han Tsai R. ;
Hsu W.-L. .
Journal of Computer Science and Technology, 2010, 25 (1) :169-179
[10]
Doan RI, 2012, P 2010 WORKSH BIOM N