Text-mining assisted regulatory annotation

被引:19
作者
Aerts, Stein [1 ,2 ]
Haeussler, Maximilian [3 ]
van Vooren, Steven [4 ]
Griffith, Obi L. [5 ]
Hulpiau, Paco [6 ]
Jones, Steven J. M. [5 ]
Montgomery, Stephen B.
Bergman, Casey M. [7 ]
机构
[1] VIB, Dept Mol & Dev Genet, Neurogenet Lab, B-3000 Louvain, Belgium
[2] Katholieke Univ Leuven, Sch Med, Dept Human Genet, B-3000 Louvain, Belgium
[3] CNRS, Inst Neurosci A Fessard, F-91198 Gif Sur Yvette, France
[4] Katholieke Univ Leuven, Dept Elect Engn, B-3001 Heverlee, Belgium
[5] British Columbia Canc Agcy, Canadas Michael Smith Genome Sci Ctr, Vancouver, BC V5Z 4E6, Canada
[6] Univ Ghent VIB, Dept Mol Biomed Res, B-9052 Ghent, Belgium
[7] Univ Manchester, Fac Life Sci, Manchester M13 9PT, Lancs, England
基金
加拿大健康研究院; 加拿大自然科学与工程研究理事会; 美国国家科学基金会;
关键词
D O I
10.1186/gb-2008-9-2-r31
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature. Results: We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process. Conclusion: Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.
引用
收藏
页数:13
相关论文
共 45 条
[1]   FlyTF:: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster [J].
Adryan, Boris ;
Teichmann, Sarah A. .
BIOINFORMATICS, 2006, 22 (12) :1532-1533
[2]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[3]   Drosophila melanogaster:: A case study of a model genomic sequence and its consequences [J].
Ashburner, M ;
Bergman, CM .
GENOME RESEARCH, 2005, 15 (12) :1661-1667
[4]  
Benson DA, 2010, NUCLEIC ACIDS RES, V38, pD46, DOI [10.1093/nar/gkp1024, 10.1093/nar/gkq1079, 10.1093/nar/gkl986, 10.1093/nar/gks1195, 10.1093/nar/gkw1070, 10.1093/nar/gkr1202, 10.1093/nar/gkn723, 10.1093/nar/gkx1094]
[5]   Drosophila DNase I footprint database:: a systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster [J].
Bergman, CM ;
Carlson, JW ;
Celniker, SE .
BIOINFORMATICS, 2005, 21 (08) :1747-1749
[6]   ABS:: a database of Annotated regulatory Binding Sites from orthologous promoters [J].
Blanco, Enrique ;
Farre, Domenec ;
Alba, M. Mar ;
Messeguer, Xavier ;
Guigo, Roderic .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D63-D67
[7]   FlyBase: genomes by the dozen [J].
Crosby, Madeline A. ;
Goodman, Joshua L. ;
Strelets, Victor B. ;
Zhang, Peili ;
Gelbart, William M. .
NUCLEIC ACIDS RESEARCH, 2007, 35 :D486-D491
[8]   Community annotation: Procedures, protocols, and supporting tools [J].
Elsik, Christine G. ;
Worley, Kim C. ;
Zhang, Lan ;
Milshina, Natalia V. ;
Jiang, Huaiyang ;
Reese, Justin T. ;
Childs, Kevin L. ;
Venkatraman, Anand ;
Dickens, C. Michael ;
Weinstock, George M. ;
Gibbs, Richard A. .
GENOME RESEARCH, 2006, 16 (11) :1329-1333
[9]   REDfly:: A regulatory element database for Drosophila [J].
Gallo, SM ;
Li, L ;
Hu, Z ;
Halfon, MS .
BIOINFORMATICS, 2006, 22 (03) :381-383
[10]   Object-oriented Transcription Factors Database (ooTFD) [J].
Ghosh, D .
NUCLEIC ACIDS RESEARCH, 2000, 28 (01) :308-310