Gene name identification and normalization using a model organism database

被引:47
作者
Morgan, AA
Hirschman, L
Colosimo, M
Yeh, AS
Colombe, JB
机构
[1] Mitre Corp, Bedford, MA 01730 USA
[2] Tufts Univ, Dept Biol, Medford, MA 02155 USA
基金
美国国家科学基金会;
关键词
gene name finding; FlyBase; named entity extraction; text mining; natural language processing; bioNLP;
D O I
10.1016/j.jbi.2004.08.010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
\Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision. (C) 2004 Elsevier Inc. All rights reserved.
引用
收藏
页码:396 / 410
页数:15
相关论文
共 22 条
  • [1] [Anonymous], P COLING
  • [2] An algorithm that learns what's in a name
    Bikel, DM
    Schwartz, R
    Weischedel, RM
    [J]. MACHINE LEARNING, 1999, 34 (1-3) : 211 - 231
  • [3] COHEN KB, 2002, P WORKSH NAT LANG PR
  • [4] Craven M, 1999, Proc Int Conf Intell Syst Mol Biol, P77
  • [5] PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine
    Donaldson, I
    Martin, J
    de Bruijn, B
    Wolting, C
    Lay, V
    Tuekam, B
    Zhang, SD
    Baskin, B
    Bader, GD
    Michalickova, K
    Pawson, T
    Hogue, CWV
    [J]. BMC BIOINFORMATICS, 2003, 4 (1)
  • [6] Protein names and how to find them
    Franzén, K
    Eriksson, G
    Olsson, F
    Asker, L
    Lidén, P
    Cöster, J
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) : 49 - 61
  • [7] Fukuda K, 1998, Pac Symp Biocomput, P707
  • [8] Protein structures and information extraction from biological texts: The PASTA system
    Gaizauskas, R
    Demetriou, G
    Artymiuk, PJ
    Willett, P
    [J]. BIOINFORMATICS, 2003, 19 (01) : 135 - 143
  • [9] HATZIVASSILOGLO.V, 2001, BIOINFORMATICS, P97
  • [10] Rutabaga by any other name: extracting biological names
    Hirschman, L
    Morgan, AA
    Yeh, AS
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2002, 35 (04) : 247 - 259