Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD)

被引:97
作者
Wiegers, Thomas C. [1 ]
Davis, Allan Peter [1 ]
Cohen, K. Bretonnel [2 ,3 ]
Hirschman, Lynette [3 ]
Mattingly, Carolyn J. [1 ]
机构
[1] Mt Desert Isl Biol Lab, Dept Bioinformat, Salsbury Cove, ME USA
[2] Univ Colorado, Sch Med, Ctr Computat Pharmacol, Aurora, CO USA
[3] Mitre Corp, Ctr Informat Technol, Bedford, MA 01730 USA
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
TOOL; EXTRACTION; RETRIEVAL; PROTEIN;
D O I
10.1186/1471-2105-10-326
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical-and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage. Results: Prototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking). Conclusion: This text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.
引用
收藏
页数:12
相关论文
共 31 条
[1]   Text mining for biology - the way forward: opinions from leading scientists [J].
Altman, Russ B. ;
Bergman, Casey M. ;
Blake, Judith ;
Blaschke, Christian ;
Cohen, Aaron ;
Gannon, Frank ;
Grivell, Les ;
Hahn, Udo ;
Hersh, William ;
Hirschman, Lynette ;
Jensen, Lars Juhl ;
Krallinger, Martin ;
Mons, Barend ;
O'Donoghue, Sean I. ;
Peitsch, Manuel C. ;
Rebholz-Schuhmann, Dietrich ;
Shatkay, Hagit ;
Valencia, Alfonso .
GENOME BIOLOGY, 2008, 9
[2]  
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[3]  
Camon EB, 2005, BMC BIOINFORMATICS, V6, DOI 10.1186/1471-2105-6-S1-S17
[4]   Content-rich biological network constructed by mining PubMed abstracts [J].
Chen, H ;
Sharp, BM .
BMC BIOINFORMATICS, 2004, 5 (1)
[5]  
Corbett P., 2008, P WORKSH CURR TRENDS
[6]  
CORBETT P, 2006, COMPUTATIONAL LIFE 2, V4216
[7]   The Comparative Toxicogenomics Database facilitates identification and understanding of chemical-gene-disease associations: arsenic as a case study [J].
Davis, Allan P. ;
Murphy, Cynthia G. ;
Rosenstein, Michael C. ;
Wiegers, Thomas C. ;
Mattingly, Carolyn J. .
BMC MEDICAL GENOMICS, 2008, 1 (1)
[8]   Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks [J].
Davis, Allan Peter ;
Murphy, Cynthia G. ;
Saraceni-Richards, Cynthia A. ;
Rosenstein, Michael C. ;
Wiegers, Thomas C. ;
Mattingly, Carolyn J. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D786-D792
[9]   Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text [J].
Garten, Yael ;
Altman, Russ B. .
BMC BIOINFORMATICS, 2009, 10 :S6
[10]  
Gospodnetic O., 2004, LUCENE ACTION