Annotating genes and genomes with DNA sequences extracted from biomedical articles

被引:19
作者
Haeussler, Maximilian [1 ]
Gerner, Martin [1 ]
Bergman, Casey M. [1 ]
机构
[1] Univ Manchester, Fac Life Sci, Manchester M13 9PT, Lancs, England
基金
英国生物技术与生命科学研究理事会;
关键词
TUMOR-NECROSIS-FACTOR; ENSEMBL;
D O I
10.1093/bioinformatics/btr043
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study. Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that similar to 20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments. Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.
引用
收藏
页码:980 / 986
页数:7
相关论文
共 31 条
[1]   Text-mining assisted regulatory annotation [J].
Aerts, Stein ;
Haeussler, Maximilian ;
van Vooren, Steven ;
Griffith, Obi L. ;
Hulpiau, Paco ;
Jones, Steven J. M. ;
Montgomery, Stephen B. ;
Bergman, Casey M. .
GENOME BIOLOGY, 2008, 9 (02)
[2]   On the persistence of supplementary resources in biomedical publications [J].
Anderson, Nicholas R. ;
Tarczy-Hornoch, Peter ;
Bumgarner, Roger E. .
BMC BIOINFORMATICS, 2006, 7 (1)
[3]  
Benson DA, 2013, NUCLEIC ACIDS RES, V41, pD36, DOI [10.1093/nar/gkn723, 10.1093/nar/gkp1024, 10.1093/nar/gkw1070, 10.1093/nar/gkr1202, 10.1093/nar/gkx1094, 10.1093/nar/gkl986, 10.1093/nar/gkq1079, 10.1093/nar/gks1195, 10.1093/nar/gkg057]
[4]   Biopython']python: freely available Python']Python tools for computational molecular biology and bioinformatics [J].
Cock, Peter J. A. ;
Antao, Tiago ;
Chang, Jeffrey T. ;
Chapman, Brad A. ;
Cox, Cymon J. ;
Dalke, Andrew ;
Friedberg, Iddo ;
Hamelryck, Thomas ;
Kauff, Frank ;
Wilczynski, Bartek ;
de Hoon, Michiel J. L. .
BIOINFORMATICS, 2009, 25 (11) :1422-1423
[5]   Data preparation and interannotator agreement: BioCreAtIvE task IB [J].
Colosimo, ME ;
Morgan, AA ;
Yeh, AS ;
Colombe, JB ;
Hirschman, L .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[6]   The Distributed Annotation System [J].
Dowell, Robin D. ;
Jokerst, Rodney M. ;
Day, Allen ;
Eddy, Sean R. ;
Stein, Lincoln .
BMC BIOINFORMATICS, 2001, 2 (1)
[7]   Identification of Arx transcriptional targets in the developing basal forebrain [J].
Fulp, Carl T. ;
Cho, Ginam ;
Marsh, Eric D. ;
Nasrallah, Ilya M. ;
Labosky, Patricia A. ;
Golden, Jeffrey A. .
HUMAN MOLECULAR GENETICS, 2008, 17 (23) :3740-3760
[8]   PubDNA Finder: a web database linking full-text articles to sequences of nucleic acids [J].
Garcia-Remesal, Miguel ;
Cuevas, Alejandro ;
Perez-Rey, David ;
Martin, Luis ;
Anguita, Alberto ;
de la Iglesia, Diana ;
de la Calle, Guillermo ;
Crespo, Jose ;
Maojo, Victor .
BIOINFORMATICS, 2010, 26 (21) :2801-2802
[9]   A method for automatically extracting infectious disease-related primers and probes from the literature [J].
Garcia-Remesal, Miguel ;
Cuevas, Alejandro ;
Lopez-Alonso, Victoria ;
Lopez-Campos, Guillermo ;
de la Calle, Guillermo ;
de la Iglesia, Diana ;
Perez-Rey, David ;
Crespo, Jose ;
Martin-Sanchez, Fernando ;
Maojo, Victor .
BMC BIOINFORMATICS, 2010, 11
[10]   The FlyBase database of the Drosophila genome projects and community literature [J].
Gelbart, W ;
Bayraktaroglu, L ;
Bettencourt, B ;
Campbell, K ;
Crosby, M ;
Emmert, D ;
Hradecky, P ;
Huang, Y ;
Letovsky, S ;
Matthews, B ;
Russo, S ;
Schroeder, A ;
Smutniak, F ;
Zhou, P ;
Zytkovicz, M ;
Ashburner, M ;
Drysdale, R ;
de Grey, A ;
Foulger, R ;
Millburn, G ;
Yamada, C ;
Kaufman, T ;
Matthews, K ;
Gilbert, D ;
Grumbling, G ;
Strelets, V ;
Shemen, C ;
Rubin, G ;
Berman, B ;
Frise, E ;
Gibson, M ;
Harris, N ;
Kaminker, J ;
Lewis, S ;
Marshall, B ;
Misra, S ;
Mungall, C ;
Prochnik, S ;
Richter, J ;
Smith, C ;
Shu, S ;
Tupy, J ;
Wiel, C .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :172-175