Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE

被引:16
作者
Neveol, Aurelie
Wilbur, W. John
Lu, Zhiyong
机构
[1] National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2012年
基金
美国国家卫生研究院;
关键词
BIOMEDICAL TEXT; JOURNAL ARTICLES; BIOCREATIVE III; EXPRESSION; DATABASES; CURATION;
D O I
10.1093/database/bas026
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, <50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation. Database URLs: http://www.ncbi.nlm.nih.gov/PubMed,http://www.ncbi.nlm.nih.gov/geo/,http://www.rcsb.org/pdb/
引用
收藏
页数:9
相关论文
共 29 条
[21]   Textpresso:: An ontology-based information retrieval and extraction system for biological literature [J].
Müller, HM ;
Kenny, EE ;
Sternberg, PW .
PLOS BIOLOGY, 2004, 2 (11) :1984-1998
[22]   Extraction of data deposition statements from the literature: a method for automatically tracking research results [J].
Neveol, Aurelie ;
Wilbur, W. John ;
Lu, Zhiyong .
BIOINFORMATICS, 2011, 27 (23) :3306-3312
[23]   A recent advance in the automatic indexing of the biomedical literature [J].
Neveol, Aurelie ;
Shooshan, Sonya E. ;
Humphrey, Susanne M. ;
Mork, James G. ;
Aronson, Alan R. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) :814-823
[24]   Much room for improvement in deposition rates of expression microarray datasets (vol 5, pg 991, 2008) [J].
Ochsner, Scott A. ;
Steffen, David L. ;
Stoeckert, Christian J., Jr. ;
McKenna, Neil J. .
NATURE METHODS, 2008, 5 (12) :991-991
[25]   A systematic literature review of automated clinical coding and classification systems [J].
Stanfill, Mary H. ;
Williams, Margaret ;
Fenton, Susan H. ;
Jenders, Robert A. ;
Hersh, William B. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (06) :646-651
[26]   Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) [J].
Wiegers, Thomas C. ;
Davis, Allan Peter ;
Cohen, K. Bretonnel ;
Hirschman, Lynette ;
Mattingly, Carolyn J. .
BMC BIOINFORMATICS, 2009, 10 :326
[27]  
Wilbur WJ, 1999, J AM MED INFORM ASSN, P176
[28]  
Yorks M., 2006, NLM TECH B, V349, pe5
[29]   Frontiers of biomedical text mining: current progress [J].
Zweigenbaum, Pierre ;
Demner-Fushman, Dina ;
Yu, Hong ;
Cohen, Kevin B. .
BRIEFINGS IN BIOINFORMATICS, 2007, 8 (05) :358-375