eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality

被引:3
作者
Ding, Ruoyao [1 ]
Boutet, Emmanuel [2 ]
Lieberherr, Damien [2 ]
Schneider, Michel [2 ]
Tognolli, Michael [2 ]
Wu, Cathy H. [1 ,3 ,4 ,5 ]
Vijay-Shanker, K. [1 ]
Arighi, Cecilia N. [1 ,3 ,4 ,5 ]
机构
[1] Univ Delaware, Dept Comp & Informat Sci, Newark, DE 19716 USA
[2] Ctr Med Univ Geneva, Swiss Inst Bioinformat, Geneva, Switzerland
[3] Univ Delaware, Ctr Bioinformat & Computat Biol, Newark, DE 19716 USA
[4] Univ Delaware, Prot Informat Resource, Newark, DE 19716 USA
[5] Georgetown Univ, Washington, DC 20007 USA
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2017年
基金
美国国家卫生研究院;
关键词
DATABASE; PROTEIN;
D O I
10.1093/database/bax081
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
UniProt Knowledgebase (UniProtKB) is a publicly available database with access to a vast amount of protein sequence and functional information. To widen the scope of the publications associated with a protein entry, UniProt has introduced the computationally mapped additional bibliography section, which includes literature collected from external sources. In this article, we describe a text mining system, eGenPub, which selects articles that are 'about' specific proteins and allows automatic identification of additional bibliography for given UniProt protein entries. Focusing on plant proteins initially, eGenPub utilizes a gene normalization tool called pGenN, and a trained support vector machine model, which achieves a precision of 95.3%, to predict whether an article, based on its abstract, should be linked to a given UniProt entry. We have conducted a full-scale PubMed processing using eGenPub for eight common plant species. Altogether, 9025 articles are identified as relevant bibliography for 4752 UniProt entries, among which 5252 are additional papers not in the existing publication section. These newly computationally mapped additional bibliography via eGenPub is being integrated in the UniProt production pipeline, and can be accessed via the UniProtKB protein entry publication view.
引用
收藏
页数:9
相关论文
共 24 条
  • [1] Agarwala R, 2018, NUCLEIC ACIDS RES, V46, pD8, DOI [10.1093/nar/gks1189, 10.1093/nar/gkx1095, 10.1093/nar/gkq1172]
  • [2] Arighi C., 2017, BIOINFORMATICS, DOI [10.1093/bioinformatics/btx439, DOI 10.1093/BIOINFORMATICS/BTX439]
  • [3] UniProt: the universal protein knowledgebase
    Bateman, Alex
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Alpi, Emanuele
    Antunes, Ricardo
    Bely, Benoit
    Bingley, Mark
    Bonilla, Carlos
    Britto, Ramona
    Bursteinas, Borisas
    Bye-A-Jee, Hema
    Cowley, Andrew
    Da Silva, Alan
    De Giorgi, Maurizio
    Dogan, Tunca
    Fazzini, Francesco
    Castro, Leyla Garcia
    Figueira, Luis
    Garmiri, Penelope
    Georghiou, George
    Gonzalez, Daniel
    Hatton-Ellis, Emma
    Li, Weizhong
    Liu, Wudong
    Lopez, Rodrigo
    Luo, Jie
    Lussi, Yvonne
    MacDougall, Alistair
    Nightingale, Andrew
    Palka, Barbara
    Pichler, Klemens
    Poggioli, Diego
    Pundir, Sangya
    Pureza, Luis
    Qi, Guoying
    Rosanoff, Steven
    Saidi, Rabie
    Sawford, Tony
    Shypitsyna, Aleksandra
    Speretta, Elena
    Turner, Edward
    Tyagi, Nidhi
    Volynkin, Vladimir
    Wardell, Tony
    Warner, Kate
    Watkins, Xavier
    Zaru, Rossana
    Zellner, Hermann
    Xenarios, Ioannis
    [J]. NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) : D158 - D169
  • [4] Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
  • [5] Boutet E, 2016, METHODS MOL BIOL, V1374, P23, DOI 10.1007/978-1-4939-3167-5_2
  • [6] The Pea TCP Transcription Factor PsBRC1 Acts Downstream of Strigolactones to Control Shoot Branching
    Braun, Nils
    de Saint Germain, Alexandre
    Pillot, Jean-Paul
    Boutet-Mercey, Stephanie
    Dalmais, Marion
    Antoniadi, Ioanna
    Li, Xin
    Maia-Grondard, Alessandra
    Le Signor, Christine
    Bouteiller, Nathalie
    Luo, Da
    Bendahmane, Abdelhafid
    Turnbull, Colin
    Rameau, Catherine
    [J]. PLANT PHYSIOLOGY, 2012, 158 (01) : 225 - 238
  • [7] 2 GENES ENCODING GF14-(14-3-3)-PROTEINS IN ZEA-MAYS - STRUCTURE, EXPRESSION, AND POTENTIAL REGULATION BY THE G-BOX-BINDING COMPLEX
    DEVETTEN, NC
    FERL, RJ
    [J]. PLANT PHYSIOLOGY, 1994, 106 (04) : 1593 - 1604
  • [8] pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
    Ding, Ruoyao
    Arighi, Cecilia N.
    Lee, Jung-Youn
    Wu, Cathy H.
    Vijay-Shanker, K.
    [J]. PLOS ONE, 2015, 10 (08):
  • [9] Purification and identification of linoleic acid hydroperoxides generated by soybean seed lipoxygenases 2 and 3
    Fukushige, H
    Wang, CX
    Simpson, TD
    Gardner, HW
    Hildebrand, DF
    [J]. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY, 2005, 53 (14) : 5691 - 5694
  • [10] WormBase 2016: expanding to enable helminth genomic research
    Howe, Kevin L.
    Bolt, Bruce J.
    Cain, Scott
    Chan, Juancarlos
    Chen, Wen J.
    Davis, Paul
    Done, James
    Down, Thomas
    Gao, Sibyl
    Grove, Christian
    Harris, Todd W.
    Kishore, Ranjana
    Lee, Raymond
    Lomax, Jane
    Li, Yuling
    Muller, Hans-Michael
    Nakamura, Cecilia
    Nuin, Paulo
    Paulini, Michael
    Raciti, Daniela
    Schindelman, Gary
    Stanley, Eleanor
    Tuli, Mary Ann
    Van Auken, Kimberly
    Wang, Daniel
    Wang, Xiaodong
    Williams, Gary
    Wright, Adam
    Yook, Karen
    Berriman, Matthew
    Kersey, Paul
    Schedl, Tim
    Stein, Lincoln
    Sternberg, Paul W.
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (D1) : D774 - D780