BioRAT: extracting biological information from full-length papers

被引:89
作者
Corney, DPA [1 ]
Buxton, BF [1 ]
Langdon, WB [1 ]
Jones, DT [1 ]
机构
[1] UCL, Dept Comp Sci, Bioinformat Unit, London WC1E 6BT, England
基金
英国生物技术与生命科学研究理事会;
关键词
D O I
10.1093/bioinformatics/bth386
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Converting the vast quantity of free-format text found in journals into a concise, structured format makes the researcher's quest for information easier. Recently, several information extraction systems have been developed that attempt to simplify the retrieval and analysis of biological and medical data. Most of this work has used the abstract alone, owing to the convenience of access and the quality of data. Abstracts are generally available through central collections with easy direct access (e.g. PubMed). The full-text papers contain more information, but are distributed across many locations (e.g. publishers' web sites, journal web sites and local repositories), making access more difficult. In this paper, we present BioRAT, a new information extraction (IE) tool, specifically designed to perform biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a Biological Research Assistant for Text mining, and incorporates a document search ability with domain-specific IE. Results: We show first, that BioRAT performs as well as existing systems, when applied to abstracts; and second, that significantly more information is available to BioRAT through the full-length papers than via the abstracts alone. Typically, less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. Overall, BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.6% recall with 51.25% precision on full-length papers.
引用
收藏
页码:3206 / 3213
页数:8
相关论文
共 11 条
[1]  
[Anonymous], 2002, ACL
[2]  
Blaschke C, 2002, IEEE INTELL SYST, V17, P14, DOI 10.1109/MIS.2002.999215
[3]   Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study [J].
Blaschke, C ;
Valencia, A .
COMPARATIVE AND FUNCTIONAL GENOMICS, 2001, 2 (04) :196-206
[4]  
COLLIER R, 1998, THESIS U SHEFFIELD U
[5]  
Craven M, 1999, Proc Int Conf Intell Syst Mol Biol, P77
[6]   Protein structures and information extraction from biological texts: The PASTA system [J].
Gaizauskas, R ;
Demetriou, G ;
Artymiuk, PJ ;
Willett, P .
BIOINFORMATICS, 2003, 19 (01) :135-143
[7]  
Murray-Rust P., 2002, Data Science Journal, V1, P84, DOI [DOI 10.2481/DSJ.1.84, 10.2481/dsj.1.84]
[8]  
THOMAS J, 2000, PACIFIC S BIOCOMPUTI, V5, P538
[9]   DIP, the Database of Interacting Proteins:: a research tool for studying cellular networks of protein interactions [J].
Xenarios, I ;
Salwínski, L ;
Duan, XQJ ;
Higney, P ;
Kim, SM ;
Eisenberg, D .
NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :303-305
[10]   Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup [J].
Yeh, Alexander S. ;
Hirschman, Lynette ;
Morgan, Alexander A. .
BIOINFORMATICS, 2003, 19 :i331-i339