Extraction of semantic biomedical relations from text using conditional random fields

被引:133
作者
Bundschus, Markus [1 ,3 ]
Dejori, Mathaeus [2 ]
Stetter, Martin
Tresp, Volker
Kriegel, Hans-Peter [1 ,3 ]
机构
[1] Siemens AG, Corp Technol Informat & Commun, D-81739 Munich, Germany
[2] Siemens Corp Res, Integrated Data Syst Dept, Princeton, NJ 08540 USA
[3] Univ Munich, Inst Comp Sci, D-80538 Munich, Germany
关键词
D O I
10.1186/1471-2105-9-207
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The increasing amount of published literature in biomedicine represents an immense source of knowledge, which can only efficiently be accessed by a new generation of automated information extraction tools. Named entity recognition of well-defined objects, such as genes or proteins, has achieved a sufficient level of maturity such that it can form the basis for the next step: the extraction of relations that exist between the recognized entities. Whereas most early work focused on the mere detection of relations, the classification of the type of relation is also of great importance and this is the focus of this work. In this paper we describe an approach that extracts both the existence of a relation and its type. Our work is based on Conditional Random Fields, which have been applied with much success to the task of named entity recognition. Results: We benchmark our approach on two different tasks. The first task is the identification of semantic relations between diseases and treatments. The available data set consists of manually annotated PubMed abstracts. The second task is the identification of relations between genes and diseases from a set of concise phrases, so-called GeneRIF (Gene Reference Into Function) phrases. In our experimental setting, we do not assume that the entities are given, as is often the case in previous relation extraction work. Rather the extraction of the entities is solved as a subproblem. Compared with other state-of-the-art approaches, we achieve very competitive results on both data sets. To demonstrate the scalability of our solution, we apply our approach to the complete human GeneRIF database. The resulting gene-disease network contains 34758 semantic associations between 4939 genes and 1745 diseases. The gene-disease network is publicly available as a machine-readable RDF graph. Conclusion: We extend the framework of Conditional Random Fields towards the annotation of semantic relations from text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.
引用
收藏
页数:14
相关论文
共 47 条
[1]  
[Anonymous], P JOINT IAPR INT WOR
[2]  
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[3]   Biomarkers and surrogate endpoints: Preferred definitions and conceptual framework [J].
Atkinson, AJ ;
Colburn, WA ;
DeGruttola, VG ;
DeMets, DL ;
Downing, GJ ;
Hoth, DF ;
Oates, JA ;
Peck, CC ;
Schooley, RT ;
Spilker, BA ;
Woodcock, J ;
Zeger, SL .
CLINICAL PHARMACOLOGY & THERAPEUTICS, 2001, 69 (03) :89-95
[4]   Network biology:: Understanding the cell's functional organization [J].
Barabási, AL ;
Oltvai, ZN .
NATURE REVIEWS GENETICS, 2004, 5 (02) :101-U15
[5]  
BELLEAU F, 2007, 16 INT WORLD WID WEB
[6]  
BERNERSLEE T, 2001, SEMANTIC WEB
[7]  
Blaschke C, 1999, Proc Int Conf Intell Syst Mol Biol, P60
[8]  
Bodenreider Olivier, 2004, NUCL ACIDS RES
[9]   Comparative experiments on learning information extractors for proteins and their interactions [J].
Bunescu, R ;
Ge, RF ;
Kate, RJ ;
Marcotte, EM ;
Mooney, RJ ;
Ramani, AK ;
Wong, YW .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155
[10]  
Bunescu R. C., 2005, P ADV NEURAL INFORM, P171