Two learning approaches for protein name extraction

被引:8
作者
Tatar, Serhan [1 ]
Cicekli, Ilyas [1 ]
机构
[1] Bilkent Univ, Dept Comp Engn, TR-06800 Ankara, Turkey
关键词
Statistical learning; Bigram language model; Rule learning; Protein name extraction; Information extraction; GENE; IDENTIFICATION; PERFORMANCE; BLAST;
D O I
10.1016/j.jbi.2009.05.004
中图分类号
TP39 [计算机的应用];
学科分类号
080201 [机械制造及其自动化];
摘要
Protein name extraction, one of the basic tasks in automatic extraction of information from biological texts, remains challenging. In this paper, we explore the use of two different machine learning techniques and present the results of the conducted experiments. in the first method, Bigram language model is used to extract protein names. In the latter, we use an automatic rule learning method that can identify protein names located in the biological texts. In both cases, we generalize protein names by using hierarchically categorized syntactic token types. We conducted our experiments on two different datasets. our first method based on Bigram language model achieved an F-score of 67.7% on the YAPEX dataset and 66.8% on the GENIA corpus. The developed rule learning method obtained 61.8% F-score value on the YAPEX dataset and 61.0% on the GENIA corpus. The results of the comparative experiments demonstrate that both techniques are applicable to the task of automatic protein name extraction, a prerequisite for the large-scale processing of biomedical literature. (C) 2009 Elsevier Inc. All rights reserved.
引用
收藏
页码:1046 / 1055
页数:10
相关论文
共 39 条
[1]
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]
[Anonymous], 2000, P 18 C COMP LING COL, DOI [DOI 10.3115/990820, DOI 10.3115/990820.990850]
[3]
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[4]
BRILL E, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P722
[5]
Brill E, 1995, COMPUT LINGUIST, V21, P543
[6]
Comparative experiments on learning information extractors for proteins and their interactions [J].
Bunescu, R ;
Ge, RF ;
Kate, RJ ;
Marcotte, EM ;
Mooney, RJ ;
Ramani, AK ;
Wong, YW .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155
[7]
Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
[8]
Generalizing predicates with string arguments [J].
Cicekli, Ilyas ;
Cicekli, Nihan Kesim .
APPLIED INTELLIGENCE, 2006, 25 (01) :23-36
[9]
Duda R., 1973, Pattern classification and scene analysis, P457
[10]
Protein names and how to find them [J].
Franzén, K ;
Eriksson, G ;
Olsson, F ;
Asker, L ;
Lidén, P ;
Cöster, J .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) :49-61