Automatically annotating documents with normalized gene lists

被引:35
作者
Crim, J [1 ]
McDonald, R [1 ]
Pereira, F [1 ]
机构
[1] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
Pattern Match; Gene Normalization; Candidate List; Development Data; Gene Tagger;
D O I
10.1186/1471-2105-6-S1-S13
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Document gene normalization is the problem of creating a list of unique identifiers for genes that are mentioned within a document. Automating this process has many potential applications in both information extraction and database curation systems. Here we present two separate solutions to this problem. The first is primarily based on standard pattern matching and information extraction techniques. The second and more novel solution uses a statistical classifier to recognize valid gene matches from a list of known gene synonyms. Results: We compare the results of the two systems, analyze their merits and argue that the classification based system is preferable for many reasons including performance, simplicity and robustness. Our best systems attain a balanced precision and recall in the range of 74%-92%, depending on the organism.
引用
收藏
页数:7
相关论文
共 15 条
  • [1] [Anonymous], 2003, NATURAL LANGUAGE PRO
  • [2] The psychology of reactions to environmental agents
    Berglund, B
    Job, RFS
    [J]. ENVIRONMENT INTERNATIONAL, 1996, 22 (01) : 1 - 1
  • [3] CHEN SF, 1999, GAUSSIAN PRIOR SMOOT
  • [4] COHEN WW, 2003, P IIWEB WORKSH
  • [5] Overview of BioCreAtIvE task IB: normalized gene lists
    Hirschman, L
    Colosimo, M
    Morgan, A
    Yeh, A
    [J]. BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [6] KAZAMA J, 2002, P NAT LANG PROC BIOM
  • [7] MALOUF R, 2002, P 6 C NAT LANG LEARN
  • [8] McCallum A.K., 2002, MALLET: A Machine Learning for Language Toolkit
  • [9] Identifying gene and protein mentions in text using conditional random fields
    McDonald, R
    Pereira, F
    [J]. BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
  • [10] MORGAN AA, 2004, IN PRESS J BIOMEDICA