A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

被引:59
作者
Verspoor, Karin [1 ]
Cohen, Kevin Bretonnel [1 ,2 ]
Lanfranchi, Arrick [2 ,3 ]
Warner, Colin [4 ]
Johnson, Helen L. [1 ]
Roeder, Christophe [1 ]
Choi, Jinho D. [3 ]
Funk, Christopher [1 ]
Malenkiy, Yuriy [1 ]
Eckert, Miriam [2 ]
Xue, Nianwen [4 ]
Baumgartner, William A., Jr. [1 ]
Bada, Michael [1 ]
Palmer, Martha [2 ]
Hunter, Lawrence E. [1 ]
机构
[1] U Colorado Sch Med, Computat Biosci Program, Aurora, CO 80045 USA
[2] Univ Colorado Boulder, Dept Linguist, Boulder, CO 80309 USA
[3] Univ Colorado Boulder, Inst Cognit Sci, Boulder, CO 80309 USA
[4] Brandeis Univ, Dept Comp Sci, Waltham, MA 02454 USA
来源
BMC BIOINFORMATICS | 2012年 / 13卷
关键词
INFORMATION EXTRACTION; GENE; PARSERS; NAMES;
D O I
10.1186/1471-2105-13-207
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
引用
收藏
页数:26
相关论文
共 72 条
[1]  
[Anonymous], 2009, Jinho D. Choi, Nicolas Nicolov, P205
[2]  
[Anonymous], 2011, Proceedings of the BioNLP Shared Task 2011 Workshop
[3]  
[Anonymous], 1990, PART OF SPEECH TAGGI, DOI 10.1017/CBO9781107415324.004
[4]  
[Anonymous], 1993, COMPUT LINGUIST, DOI DOI 10.21236/ADA273556
[5]  
[Anonymous], TECHNICAL REPORT
[6]  
[Anonymous], 2004, PROC INT JOINT WORKS
[7]  
[Anonymous], 2009, TECH REP AP SOFTW FD
[8]  
[Anonymous], 2010, P BIOTXTM 2010 2 WOR
[9]  
Arighi C, 2010, P BIOCREATIVE
[10]  
Bada M, CONCEPT ANNOTATION C