Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs

被引:28
作者
Yang, Li [1 ]
Zhou, Yanhong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Conditional random fields; Semi-conditional random fields; Feature sets; Two-phase;
D O I
10.1007/s10115-013-0637-7
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
This paper represents a two-phase approach based on semi-Markov conditional random fields model (semi-CRFs) and explores novel feature sets for identifying the entities in text into 5 types: protein, DNA, RNA, cell_line and cell_type. Semi-CRFs put the label to a segment not a single word which is more natural than the other machine learning methods such as conditional random fields model (CRFs). Our approach divides the biomedical named entity recognition task into two sub-tasks: term boundary detection and semantic labeling. At the first phase, term boundary detection sub-task detects the boundary of the entities and classifies the entities into one type C. At the second phase, semantic labeling sub-task labels the entities detected at the first phase the correct entity type. We explore novel feature sets at both phases to improve the performance. To make a comparison, experiments conducted both on CRFs and on semi-CRFs models at each phase. Our experiments carried out on JNLPBA 2004 datasets achieve an F-score of 74.64 % based on semi-CRFs without deep domain knowledge and post-processing algorithms, which outperforms most of the state-of-the-art systems.
引用
收藏
页码:439 / 453
页数:15
相关论文
共 25 条
[1]
[Anonymous], 2004, Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA), DOI 10.3115/1567594.1567618
[2]
[Anonymous], 2004, PROC INT JOINT WORKS
[3]
[Anonymous], 2001, PROC 18 INT C MACH L
[4]
Chan SK, 2007, IEEE DATA MINING, P93, DOI 10.1109/ICDM.2007.20
[5]
A survey of current work in biomedical text mining [J].
Cohen, AM ;
Hersh, WR .
BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) :57-71
[6]
Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining [J].
de Pablo-Sanchez, Cesar ;
Segura-Bedmar, Isabel ;
Martinez, Paloma ;
Iglesias-Maqueda, Ana .
KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 35 (01) :87-109
[7]
Finkel J., 2004, JOINT WORKSHOP NATUR, P88
[8]
GuoDong Zhou., 2004, JNLPBA'04, P96, DOI DOI 10.3115/1567594.1567616
[9]
GENIA corpus-a semantically annotated corpus for bio-textmining [J].
Kim, J-D ;
Ohta, T. ;
Tateisi, Y. ;
Tsujii, J. .
BIOINFORMATICS, 2003, 19 :i180-i182
[10]
Kim S, 2005, LECT NOTES ARTIF INT, V3651, P646