High-performance gene name normalization with GENO

被引:68
作者
Wermter, Joachim [1 ]
Tomanek, Katrin [1 ]
Hahn, Udo [1 ]
机构
[1] Univ Jena, Jena Univ Language & Informat Engn JULIE Lab, D-07743 Jena, Germany
关键词
PROTEIN; EXTRACTION;
D O I
10.1093/bioinformatics/btp071
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The recognition and normalization of textual mentions of gene and protein names is both particularly important and challenging. Its importance lies in the fact that they constitute the crucial conceptual entities in biomedicine. Their recognition and normalization remains a challenging task because of widespread gene name ambiguities within species, across species, with common English words and with medical sublanguage terms. Results: We present GENO, a highly competitive system for gene name normalization, which obtains an F-measure performance of 86.4% (precision: 87.8%, recall: 85.0%) on the BIOCREATIVE-II test set, thus being on a par with the best system on that task. Our system tackles the complex gene normalization problem by employing a carefully crafted suite of symbolic and statistical methods, and by fully relying on publicly available software and data resources, including extensive background knowledge based on semantic pro. ling. A major goal of our work is to present GENO's architecture in a lucid and perspicuous way to pave the way to full reproducibility of our results.
引用
收藏
页码:815 / 821
页数:7
相关论文
共 21 条
[1]  
[Anonymous], 2004, P INT JOINT WORKSH N
[2]  
[Anonymous], 2001, P 18 INT C MACH LEAR, DOI DOI 10.5555/645530.655813
[3]  
[Anonymous], P 2 BIOCREATIVE CHAL
[4]  
BAUMGARTNER WA, 2007, P 2 BIOCREATIVE CHAL, P257
[5]   Comparative experiments on learning information extractors for proteins and their interactions [J].
Bunescu, R ;
Ge, RF ;
Kate, RJ ;
Marcotte, EM ;
Mooney, RJ ;
Ramani, AK ;
Wong, YW .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155
[6]   Gene name ambiguity of eukaryotic nomenclatures [J].
Chen, LF ;
Liu, HF ;
Friedman, C .
BIOINFORMATICS, 2005, 21 (02) :248-256
[7]  
Hahn U., 2008, Proceedings of the LREC Workshop: Towards Enhanced Interoperability for Large HLT Systems, P1
[8]  
HAKENBERG J, 2007, P BIONLP 2007 BIOL T, P153
[9]   Gene mention normalization and interaction extraction with context models and sentence motifs [J].
Hakenberg, Joerg ;
Plake, Conrad ;
Royer, Loic ;
Strobelt, Hendrik ;
Leser, Ulf ;
Schroeder, Michael .
GENOME BIOLOGY, 2008, 9
[10]   Inter-species normalization of gene mentions with GNAT [J].
Hakenberg, Joerg ;
Plake, Conrad ;
Leaman, Robert ;
Schroeder, Michael ;
Gonzalez, Graciela .
BIOINFORMATICS, 2008, 24 (16) :I126-I132