Integrating text mining into the MGI biocuration workflow

被引:32
作者
Dowell, K. G. [1 ,2 ]
McAndrews-Hill, M. S. [1 ]
Hill, D. P. [1 ]
Drabkin, H. J. [1 ]
Blake, J. A. [1 ,2 ]
机构
[1] Jackson Lab, Bar Harbor, ME 04609 USA
[2] Univ Maine, Grad Sch Biomed Sci, Orono, ME 04469 USA
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2009年
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
NETWORK;
D O I
10.1093/database/bap019
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals. In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen similar to 1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database. Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature.
引用
收藏
页数:11
相关论文
共 25 条
[1]  
AIIAGMT, AIIAGMT
[2]  
[Anonymous], INTRAPDF PDF TEXT CO
[3]  
[Anonymous], PROMINER
[4]  
[Anonymous], PDFTRON PDF CONV TOO
[5]  
[Anonymous], CRIT ASS INF EXTR BI
[6]  
[Anonymous], MOUSE GENOME INFORM
[7]  
[Anonymous], AUT ANN DOC X PDF TX
[8]  
[Anonymous], MIN LIT KNOWL MOL BI
[9]   The Mouse Genome Database genotypes::phenotypes [J].
Blake, Judith A. ;
Bult, Carol J. ;
Eppig, Janan T. ;
Kadin, James A. ;
Richardson, Joel E. .
NUCLEIC ACIDS RESEARCH, 2009, 37 :D712-D719
[10]  
Cohen A M., 2006, J Biomed Discov Collab, V1, P1