Recognition of protein/gene names from text using an ensemble of classifiers

被引:50
作者
Zhou, GD
Shen, D
Zhang, J
Su, J
Tan, SH
机构
[1] Inst Infocomm Res, Singapore 119613, Singapore
[2] Natl Univ Singapore, Sch Comp, Singapore 119610, Singapore
关键词
Support Vector Machine; Head Noun; Biomedical Domain; GENIA Corpus; Current Word;
D O I
10.1186/1471-2105-6-S1-S7
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/ gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognitiontask (Task 1A).
引用
收藏
页数:7
相关论文
共 16 条
  • [1] Bagging predictors
    Breiman, L
    [J]. MACHINE LEARNING, 1996, 24 (02) : 123 - 140
  • [2] DAN S, 2003, P ACL 2003 WORKSH NA, P49
  • [3] Joachims J., 1999, ADV KERNEL METHODS S
  • [4] A systematic RNAi screen identifies a critical role for mitochondria in C-elegans longevity
    Lee, SS
    Lee, RYN
    Fraser, AG
    Kamath, RS
    Ahringer, J
    Ruvkun, G
    [J]. NATURE GENETICS, 2003, 33 (01) : 40 - 48
  • [5] Makino T., 2002, ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, P1, DOI DOI 10.3115/1118149.1118150
  • [6] Rijsbergen V., 1979, INFORM RETRIEVAL, VSecond Edi
  • [7] SALTON G, 1990, J AM SOC INFORM SCI, V41, P288, DOI 10.1002/(SICI)1097-4571(199006)41:4<288::AID-ASI8>3.0.CO
  • [8] 2-H
  • [9] Improved boosting algorithms using confidence-rated predictions
    Schapire, RE
    Singer, Y
    [J]. MACHINE LEARNING, 1999, 37 (03) : 297 - 336
  • [10] SCHWARTZ AS, 2003, P PC S BIOC PSB 2003