pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature

被引:4
作者
Ding, Ruoyao [1 ]
Arighi, Cecilia N. [1 ,2 ]
Lee, Jung-Youn [3 ]
Wu, Cathy H. [1 ,2 ]
Vijay-Shanker, K. [1 ]
机构
[1] Univ Delaware, Dept Comp & Informat Sci, Newark, DE 19716 USA
[2] Univ Delaware, Ctr Bioinformat & Computat Biol, Newark, DE USA
[3] Univ Delaware, Dept Plant & Soil Sci, Newark, DE 19717 USA
来源
PLOS ONE | 2015年 / 10卷 / 08期
基金
美国国家科学基金会;
关键词
NOMENCLATURE; EXTRACTION; TASK;
D O I
10.1371/journal.pone.0135305
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. Methods In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. Results We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9%(Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).
引用
收藏
页数:23
相关论文
共 38 条
  • [1] [Anonymous], 2002, P ACM SIGKDD KDD 200, DOI 10.1145/775047.775067
  • [2] Making Big Data Useful for Health Care: A Summary of the Inaugural MIT Critical Data Conference
    Badawi, Omar
    Brennan, Thomas
    Celi, Leo Anthony
    Feng, Mengling
    Ghassemi, Marzyeh
    Ippolito, Andrea
    Johnson, Alistair
    Mark, Roger G.
    Mayaud, Louis
    Moody, George
    Moses, Christopher
    Naumann, Tristan
    Nikore, Vipan
    Pimentel, Marco
    Pollard, Tom J.
    Santos, Mauro
    Stone, David J.
    Zimolzak, Andrew
    [J]. JMIR MEDICAL INFORMATICS, 2014, 2 (02) : 41 - 51
  • [3] UniProt: a hub for protein information
    Bateman, Alex
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Apweiler, Rolf
    Alpi, Emanuele
    Antunes, Ricardo
    Arganiska, Joanna
    Bely, Benoit
    Bingley, Mark
    Bonilla, Carlos
    Britto, Ramona
    Bursteinas, Borisas
    Chavali, Gayatri
    Cibrian-Uhalte, Elena
    Da Silva, Alan
    De Giorgi, Maurizio
    Dogan, Tunca
    Fazzini, Francesco
    Gane, Paul
    Cas-tro, Leyla Garcia
    Garmiri, Penelope
    Hatton-Ellis, Emma
    Hieta, Reija
    Huntley, Rachael
    Legge, Duncan
    Liu, Wudong
    Luo, Jie
    MacDougall, Alistair
    Mutowo, Prudence
    Nightin-gale, Andrew
    Orchard, Sandra
    Pichler, Klemens
    Poggioli, Diego
    Pundir, Sangya
    Pureza, Luis
    Qi, Guoying
    Rosanoff, Steven
    Saidi, Rabie
    Sawford, Tony
    Shypitsyna, Aleksandra
    Turner, Edward
    Volynkin, Vladimir
    Wardell, Tony
    Watkins, Xavier
    Zellner, Hermann
    Cowley, Andrew
    Figueira, Luis
    Li, Weizhong
    McWilliam, Hamish
    [J]. NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) : D204 - D212
  • [4] Benson DA, 2013, NUCLEIC ACIDS RES, V41, pD36, DOI [10.1093/nar/gkn723, 10.1093/nar/gkp1024, 10.1093/nar/gkw1070, 10.1093/nar/gkr1202, 10.1093/nar/gkx1094, 10.1093/nar/gkl986, 10.1093/nar/gkq1079, 10.1093/nar/gks1195, 10.1093/nar/gkg057]
  • [5] Bhattacharya S., 2010, Proceedings of the BioCreative III workshop, P55
  • [6] Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
  • [7] Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles
    Dai, Hong-Jie
    Lai, Po-Ting
    Tsai, Richard Tzong-Han
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) : 412 - 420
  • [8] Fluck J., 2007, Proceedings of the second BioCreAtIvE challenge evaluation workshop, P149
  • [9] Fukuda K, 1998, Pac Symp Biocomput, P707
  • [10] The grapevine gene nomenclature system
    Grimplet, Jerome
    Adam-Blondon, Anne-Francoise
    Bert, Pierre-Francois
    Bitz, Oliver
    Cantu, Dario
    Davies, Christopher
    Delrot, Serge
    Pezzotti, Mario
    Rombauts, Stephane
    Cramer, Grant R.
    [J]. BMC GENOMICS, 2014, 15