A multi-strategy approach to biological named entity recognition

被引:27
作者
Atkinson, John [1 ]
Bull, Veronica [1 ]
机构
[1] Univ Concepcion, Dept Comp Sci, Concepcion, Chile
关键词
Named entity recognition; Natural language processing; Markov models; Bioinformatics; Machine learning; PROTEIN NAMES; SYSTEM; GENE;
D O I
10.1016/j.eswa.2012.05.033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recognizing and disambiguating bio-entities (genes, proteins, cells, etc.) names are very challenging tasks as some biologica databases can be outdated, names may not be normalized, abbreviations are used, syntactic and word order is modified, etc. Thus, the same bio-entity might be written into different ways making searching tasks a key obstacle as many candidate relevant literature containing those entities might not be found. As consequence, the same protein mention but using different names should be looked for or the same discovered protein name is being used to name a new protein using completely different features hence named-entity recognition methods are required. In this paper, we developed a bio-entity recognition model which combines different classification methods and incorporates simple pre-processing tasks for bio-entities (genes and proteins) recognition is presented. Linguistic pre-processing and feature representation for training and testing is observed to positively affect the overall performance of the method, showing promising results. Unlike some state-of-the-art methods, the approach does not require additional knowledge bases or specific-purpose tasks for post processing which make it more appealing. Experiments showing the promise of the model compared to other state-of-the-art methods are discussed. (c) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:12968 / 12974
页数:7
相关论文
共 29 条
[1]  
Ananiadou Sophia., 2005, Text Mining for Biology And Biomedicine
[2]  
[Anonymous], [No title captured]
[3]  
[Anonymous], 2006, Pattern recognition and machine learning
[4]   Discovering Novel Causal Patterns From Biomedical Natural-Language Texts Using Bayesian Nets [J].
Atkinson, John ;
Rivas, Alejandro .
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, 2008, 12 (06) :714-722
[5]  
Banka H, 2008, CH CRC COMP SCI DATA, P277
[6]   Exploring and linking biomedical resources through multidimensional semantic spaces [J].
Berlanga, Rafael ;
Jimenez-Ruiz, Ernesto ;
Nebot, Victoria .
BMC BIOINFORMATICS, 2012, 13
[7]  
Cristianini Nello, 2000, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, DOI DOI 10.1017/CB09780511801389
[8]   Extracting human protein interactions from MEDLINE using a full-sentence parser [J].
Daraselia, N ;
Yuryev, A ;
Egorov, S ;
Novichkova, S ;
Nikitin, A ;
Mazo, I .
BIOINFORMATICS, 2004, 20 (05) :604-U43
[9]   A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations [J].
Dingare, S ;
Nissim, M ;
Finkel, J ;
Manning, C ;
Grover, C .
COMPARATIVE AND FUNCTIONAL GENOMICS, 2005, 6 (1-2) :77-85
[10]   MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge [J].
Ijaz, Ali Z. ;
Song, Min ;
Lee, Doheon .
BMC BIOINFORMATICS, 2010, 11