The BioLexicon: a large-scale terminological resource for biomedical text mining

被引:37
作者
Thompson, Paul [1 ,2 ]
McNaught, John [1 ,2 ]
Montemagni, Simonetta [3 ]
Calzolari, Nicoletta [3 ]
del Gratta, Riccardo [3 ]
Lee, Vivian [4 ]
Marchi, Simone [3 ]
Monachini, Monica [3 ]
Pezik, Piotr [4 ]
Quochi, Valeria [3 ]
Rupp, C. J. [1 ,2 ]
Sasaki, Yutaka [1 ,2 ,5 ]
Venturi, Giulia [3 ]
Rebholz-Schuhmann, Dietrich [4 ]
Ananiadou, Sophia [1 ,2 ]
机构
[1] Univ Manchester, Sch Comp Sci, Manchester M13 9PL, Lancs, England
[2] Univ Manchester, Natl Ctr Text Min, Manchester Interdisciplinary Bioctr, Manchester M1 7DN, Lancs, England
[3] CNR, Ist Linguist Computaz, I-56124 Pisa, Italy
[4] European Bioinformat Inst, Cambridge CB10 1SD, England
[5] Toyota Technol Inst, Nagoya, Aichi 468, Japan
来源
BMC BIOINFORMATICS | 2011年 / 12卷
基金
英国生物技术与生命科学研究理事会; 英国医学研究理事会; 英国惠康基金;
关键词
OF-SPEECH TAGGER; ARGUMENT STRUCTURES; EVENT EXTRACTION; ANNOTATED CORPUS; ONTOLOGY; PROTEIN; TOOL; UNIFICATION; BIOLOGY; LEXICON;
D O I
10.1186/1471-2105-12-397
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e. g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results: This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e. g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions: The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.
引用
收藏
页数:29
相关论文
共 82 条
[51]  
PEZIK P, 2008, P LREC WORKSH BUILD, P35
[52]   Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches [J].
Pyysalo, Sampo ;
Salakoski, Tapio ;
Aubin, Sophie ;
Nazarenko, Adeline .
BMC BIOINFORMATICS, 2006, 7 (Suppl 3)
[53]  
Quochi V, 2008, SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, P2285
[54]  
Quochi V, 2009, LECT NOTES ARTIF INT, V5603, P325, DOI 10.1007/978-3-642-04235-5_28
[55]   CORUM: the comprehensive resource of mammalian protein complexes-2009 [J].
Ruepp, Andreas ;
Waegele, Brigitte ;
Lechner, Martin ;
Brauner, Barbara ;
Dunger-Kaltenbach, Irmtraud ;
Fobo, Gisela ;
Frishman, Goar ;
Montrone, Corinna ;
Mewes, H. -Werner .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D497-D501
[56]  
SASAKI Y, 2009, P EACL DEM SESS, P61
[57]  
Sasaki Yutaka, 2010, Journal of Bioinformatics and Computational Biology, V8, P147, DOI 10.1142/S0219720010004513
[58]   How to make the most of NE dictionaries in statistical NER [J].
Sasaki, Yutaka ;
Tsuruoka, Yoshimasa ;
McNaught, John ;
Ananiadou, Sophia .
BMC BIOINFORMATICS, 2008, 9 (Suppl 11)
[59]   Database resources of the National Center for Biotechnology Information [J].
Sayers, Eric W. ;
Barrett, Tanya ;
Benson, Dennis A. ;
Bolton, Evan ;
Bryant, Stephen H. ;
Canese, Kathi ;
Chetvernin, Vyacheslav ;
Church, Deanna M. ;
DiCuccio, Michael ;
Federhen, Scott ;
Feolo, Michael ;
Geer, Lewis Y. ;
Helmberg, Wolfgang ;
Kapustin, Yuri ;
Landsman, David ;
Lipman, David J. ;
Lu, Zhiyong ;
Madden, Thomas L. ;
Madej, Tom ;
Maglott, Donna R. ;
Marchler-Bauer, Aron ;
Miller, Vadim ;
Mizrachi, Ilene ;
Ostell, James ;
Panchenko, Anna ;
Pruitt, Kim D. ;
Schuler, Gregory D. ;
Sequeira, Edwin ;
Sherry, Stephen T. ;
Shumway, Martin ;
Sirotkin, Karl ;
Slotta, Douglas ;
Souvorov, Alexandre ;
Starchenko, Grigory ;
Tatusova, Tatiana A. ;
Wagner, Lukas ;
Wang, Yanli ;
Wilbur, W. John ;
Yaschenko, Eugene ;
Ye, Jian .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D5-D16
[60]  
Schuler K.K., 2005, VerbNet: A broad-coverage, comprehensive verb lexicon