Terminological resources for text mining over biomedical scientific literature

被引：11

作者：

Rinaldi, Fabio ^{[1
]}

Kaljurand, Kaarel ^{[1
]}

Saetre, Rune ^{[2
,3
]}

机构：

[1] Univ Zurich, Inst Computat Linguist, CH-8050 Zurich, Switzerland

[2] Norwegian Univ Sci & Technol NTNU, Dept Comp & Informat Sci IDI, NO-7491 Trondheim, Norway

[3] Univ Tokyo, Tsujii Lab, Dept Comp Sci, Bunkyo Ku, Tokyo 1130033, Japan

来源：

ARTIFICIAL INTELLIGENCE IN MEDICINE | 2011年 / 52卷 / 02期

基金：

瑞士国家科学基金会;

关键词：

Information extraction; Text mining; Terminological resources; SEARCH ENGINE; PROTEIN; GENE; DICTIONARY; ONTOGENE; SERVICES; DRUGS;

D O I：

10.1016/j.artmed.2011.04.011

中图分类号：

TP18 [人工智能理论];

学科分类号：

140502 [人工智能];

摘要：

Objective: We present a combined terminological resource for text mining over biomedical literature. The purpose of the resource is to allow the detection of mentions of specific biological entities in scientific publications, and their grounding to widely accepted identifiers. This is an essential process, useful in itself, and necessary as an intermediate step for almost every type of complex text mining application. Methods: We discuss some of the properties of the terminology for this domain, in particular the degree of ambiguity, which constitutes a peculiar problem for text mining applications. Without a correct recognition and disambiguation of the domain entities no reliable results can be produced. Results: We also discuss an application that makes use of the resulting terminological knowledge base. We annotate an existing corpus of sentences about protein interactions. The annotation consists of a normalization step that matches the terms in our resource with their actual representation in the corpus, and a disambiguation step that resolves the ambiguity of matched terms. Conclusion: In this paper we present a large terminological resource, compiled through the aggregation of a number of different manually curated sources. We discuss the lexical properties of such resources, specifically the degree of ambiguity of the terms, and we inspect the causes of such ambiguity, in particular for protein names. This information is of vital importance for the implementation of an efficient term normalization and grounding algorithm. (C) 2011 Elsevier B.V. All rights reserved.

引用

页码：107 / 114

页数：8

共 27 条

[1]

The universal protein resource (UniProt) [J].

Bairoch, Amos ;

Bougueleret, Lydie ;

Altairac, Severine ;

Amendolia, Valeria ;

Auchincloss, Andrea ;

Puy, Ghislaine Argoud ;

Axelsen, Kristian ;

Baratin, Delphine ;

Blatter, Marie-Claude ;

Boeckmann, Brigitte ;

Bollondi, Laurent ;

Boutet, Emmanuel ;

Quintaje, Silvia Braconi ;

Breuza, Lionel ;

Bridge, Alan ;

deCastro, Edouard ;

Coral, Danielle ;

Coudert, Elisabeth ;

Cusin, Isabelle ;

Dobrokhotov, Pavel ;

Dornevil, Dolnide ;

Duvaud, Severine ;

Estreicher, Anne ;

Famiglietti, Livia ;

Feuermann, Marc ;

Gehant, Sebastian ;

Farriol-Mathis, Nathalie ;

Ferro, Serenella ;

Gasteiger, Elisabeth ;

Gateau, Alain ;

Gerritsen, Vivienne ;

Gos, Arnaud ;

Gruaz-Gumowski, Nadine ;

Hinz, Ursula ;

Hulo, Chantal ;

Hulo, Nicolas ;

Ioannidis, Vassilios ;

Ivanyi, Ivan ;

James, Janet ;

Jain, Eric ;

Jimenez, Silvia ;

Jungo, Florence ;

Junker, Vivien ;

Keller, Guillaume ;

Lachaize, Corinne ;

Lane-Guermonprez, Lydie ;

Langendijk-Genevaux, Petra ;

Lara, Vicente ;

Lemercier, Philippe ;

Le Saux, Virginie .

NUCLEIC ACIDS RESEARCH, 2007, 35 :D193-D197

[2]

HAKENBERG J, 2007, P BIONLP 2007 BIOL T, P153

[3]

IntAct: an open source molecular interaction database [J].

Hermjakob, H ;

Montecchi-Palazzi, L ;

Lewington, C ;

Mudali, S ;

Kerrien, S ;

Orchard, S ;

Vingron, M ;

Roechert, B ;

Roepstorff, P ;

Valencia, A ;

Margalit, H ;

Armstrong, J ;

Bairoch, A ;

Cesareni, G ;

Sherman, D ;

Apweller, R .

NUCLEIC ACIDS RESEARCH, 2004, 32 :D452-D455

[4]

The HUPOPSI's Molecular Interaction format - a community standard for the representation of protein interaction data [J].

Hermjakob, H ;

Montecchi-Palazzi, L ;

Bader, G ;

Wojcik, R ;

Salwinski, L ;

Ceol, A ;

Moore, S ;

Orchard, S ;

Sarkans, U ;

von Mering, C ;

Roechert, B ;

Poux, S ;

Jung, E ;

Mersch, H ;

Kersey, P ;

Lappe, M ;

Li, YX ;

Zeng, R ;

Rana, D ;

Nikolski, M ;

Husi, H ;

Brun, C ;

Shanker, K ;

Grant, SGN ;

Sander, C ;

Bork, P ;

Zhu, WM ;

Pandey, A ;

Brazma, A ;

Jacq, B ;

Vidal, M ;

Sherman, D ;

Legrain, P ;

Cesareni, G ;

Xenarios, L ;

Eisenberg, D ;

Steipe, B ;

Hogue, C ;

Apweiler, R .

NATURE BIOTECHNOLOGY, 2004, 22 (02) :177-183

[5]

A dictionary to identify small molecules and drugs in free text [J].

Hettne, Kristina M. ;

Stierum, Rob H. ;

Schuemie, Martijn J. ;

Hendriksen, Peter J. M. ;

Schijvenaars, Bob J. A. ;

van Mulligen, Erik M. ;

Kleinjans, Jos ;

Kors, Jan A. .

BIOINFORMATICS, 2009, 25 (22) :2983-2991

[6]

Kaljurand K, 2009, LECT NOTES ARTIF INT, V5651, P225, DOI 10.1007/978-3-642-02976-9_32

[7]

Kappeler T., 2009, P BIONLP 2009 WORKSH, P80

[8]

PathText: a text mining integrator for biological pathway visualizations [J].

Kemper, Brian ;

Matsuzaki, Takuya ;

Matsuoka, Yukiko ;

Tsuruoka, Yoshimasa ;

Kitano, Hiroaki ;

Ananiadou, Sophia ;

Tsujii, Jun'ichi .

BIOINFORMATICS, 2010, 26 (12) :I374-I381

[9]

Kolarik C., 2008, WORKSH BUILD EV RES

[10]

Overview of the protein-protein interaction annotation extraction task of BioCreative II [J].

Krallinger, Martin ;

Leitner, Florian ;

Rodriguez-Penagos, Carlos ;

Valencia, Alfonso .

GENOME BIOLOGY, 2008, 9

← 1 2 3 →