GENIA corpus-a semantically annotated corpus for bio-textmining

被引:577
作者
Kim, J-D [1 ]
Ohta, T. [2 ]
Tateisi, Y. [1 ]
Tsujii, J. [1 ,2 ]
机构
[1] Japan Sci & Technol Corp, CREST, Bunkyo Ku, Tokyo 1130033, Japan
[2] Univ Tokyo, Dept Comp Sci, Bunkyo Ku, Tokyo 1130033, Japan
关键词
Text Mining; Information Extraction; Corpus; Natural Language Processing; Computational Molecular Biology;
D O I
10.1093/bioinformatics/btg1023
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. Results: GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400 000 words and almost 100 000 annotations for biological terms.
引用
收藏
页码:i180 / i182
页数:3
相关论文
共 3 条
[1]  
ALLEN J, 1995, NATURAL LANGUAGE UND, P25
[2]  
KIM JD, 2001, P 1 NLP XML WORKSH, P44
[3]  
OHTA T, 2002, P HUM LANG IN PRESS