Biomedical named entity recognition using two-phase model based on SVMs

被引:156
作者
Lee, KJ [1 ]
Hwang, YS [1 ]
Kim, S [1 ]
Rim, HC [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Nat Language Proc Lab, Seoul 136701, South Korea
关键词
bioinformatics; named entity recognition; SVM; two-phase model; unbalanced class distribution; hierarchical multi-class SVM;
D O I
10.1016/j.jbi.2004.08.012
中图分类号
TP39 [计算机的应用];
学科分类号
081203 [计算机应用技术]; 0835 [软件工程];
摘要
Named entity (NE) recognition has become one of the most fundamental tasks in biomedical knowledge acquisition. In this paper, we present a two-phase named entity recognizer based on SVMs, which consists of a boundary identification phase and a semantic classification phase of named entities. When adapting SVMs to named entity recognition, the multi-class problem and the unbalanced class distribution problem become very serious in terms of training cost and performance. We try to solve these problems by separating the NE recognition task into two subtasks, where we use appropriate SVM classifiers and relevant features for each subtask. In addition, by employing a hierarchical classification method based on ontology, we effectively solve the multiclass problem concerning semantic classification. The experimental results on the GENIA corpus show that the proposed method is effective not only in reducing computational cost but also in improving performance. The F-score (beta = 1) for the boundary identification is 74.8 and the F-score for the semantic classification is 66.7. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:436 / 447
页数:12
相关论文
共 18 条
[1]
[Anonymous], P COLING
[2]
[Anonymous], 2002, IEEE T NEURAL NETWOR
[3]
FUKUDA K, P PAC S BIOC, V98, P707
[4]
Hatzivassiloglou V., 2001, Bioinformatics, V17, P97
[5]
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
[6]
Kressel U.H.-G., 1999, Pairwise classification and support vector machines, advances in kernel methods: support vector learning
[7]
Makino T., 2002, ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, P1, DOI DOI 10.3115/1118149.1118150
[8]
Mining literature for protein-protein interactions [J].
Marcotte, EM ;
Xenarios, I ;
Eisenberg, D .
BIOINFORMATICS, 2001, 17 (04) :359-363
[9]
NARAYANASWAMY M, 2003, BIOL NAMED ENTITY RE
[10]
OLSSON F, 2002, P 19 INT C COMP LING, P765