BioC: a minimalist approach to interoperability for biomedical text processing

被引:92
作者
Comeau, Donald C. [1 ]
Dogan, Rezarta Islamaj [1 ]
Ciccarese, Paolo [2 ,3 ]
Cohen, Kevin Bretonnel [4 ]
Krallinger, Martin [5 ]
Leitner, Florian [5 ]
Lu, Zhiyong [1 ]
Peng, Yifan [6 ]
Rinaldi, Fabio [7 ]
Torii, Manabu [6 ]
Valencia, Alfonso [5 ]
Verspoor, Karin [8 ]
Wiegers, Thomas C. [9 ]
Wu, Cathy H. [6 ]
Wilbur, W. John [1 ]
机构
[1] Natl Lib Med, Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
[2] Massachusetts Gen Hosp, Dept Neurol, Boston, MA 02114 USA
[3] Harvard Univ, Harvard Med Sch, Boston, MA 02115 USA
[4] Univ Colorado Denver, Sch Med, Ctr Computat Pharmacol, Aurora, CO 80045 USA
[5] Spanish Natl Canc Res Ctr, Struct & Computat Biol Grp, E-28029 Madrid, Spain
[6] Univ Delaware, Dept Comp & Informat Sci, Ctr Bioinformat & Computat Biol, Newark, DE 19711 USA
[7] Univ Zurich, Inst Computat Linguist, CH-8050 Zurich, Switzerland
[8] Univ Melbourne, Victoria Res Lab, Natl ICT Australia NICTA, Parkville, Vic 3010, Australia
[9] N Carolina State Univ, Dept Biol, Raleigh, NC 27695 USA
来源
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION | 2013年
基金
美国国家科学基金会; 美国国家卫生研究院; 瑞士国家科学基金会;
关键词
COMPARATIVE TOXICOGENOMICS DATABASE; RESOURCE; TOOL;
D O I
10.1093/database/bat064
中图分类号
Q [生物科学];
学科分类号
090105 [作物生产系统与生态工程];
摘要
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions.
引用
收藏
页数:15
相关论文
共 64 条
[1]
Ananiadou S., 2007, Ariadne
[2]
[Anonymous], 2011, Text Processing with GATE (Version 6)
[3]
[Anonymous], ENGLISH LANGUAGE COR
[4]
[Anonymous], 2004, PROC INT JOINT WORKS
[5]
Overview of the BioCreative III Workshop [J].
Arighi, Cecilia N. ;
Lu, Zhiyong ;
Krallinger, Martin ;
Cohen, Kevin B. ;
Wilbur, W. John ;
Valencia, Alfonso ;
Hirschman, Lynette ;
Wu, Cathy H. .
BMC BIOINFORMATICS, 2011, 12
[6]
A formal framework for linguistic annotation [J].
Bird, S ;
Liberman, M .
SPEECH COMMUNICATION, 2001, 33 (1-2) :23-60
[7]
Bird S., 2000, CORROSION, P1
[8]
Chiarcos, 2012, LINKED DATA LINGUIST, P161
[9]
Open semantic annotation of scientific publications using DOMEO [J].
Paolo Ciccarese ;
Marco Ocana ;
Tim Clark .
Journal of Biomedical Semantics, 3 (Suppl 1)
[10]
An open annotation ontology for science on web 3.0 [J].
Ciccarese P. ;
Ocana M. ;
Garcia Castro L.J. ;
Das S. ;
Clark T. .
Journal of Biomedical Semantics, 2 (Suppl 2)