All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning

被引:137
作者
Airola, Antti [1 ]
Pyysalo, Sampo
Bjoerne, Jari
Pahikkala, Tapio
Ginter, Filip
Salakoski, Tapio
机构
[1] Univ Turku, Turku Ctr Comp Sci TUCS, FIN-20520 Turku, Finland
关键词
D O I
10.1186/1471-2105-9-S11-S2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure. Results: We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus. Conclusion: We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided.
引用
收藏
页数:12
相关论文
共 40 条
[1]  
[Anonymous], ADV NEURAL INFORM PR
[2]  
[Anonymous], P 3 INT S SEM MIN BI
[3]  
[Anonymous], 2005, P 1 INT S SEM MIN BI
[4]  
[Anonymous], 2008, Proc. of the workshop on current trends in biomedical natural language processing
[5]  
[Anonymous], 2006, AGROBIZNES
[6]  
[Anonymous], 2008, P 46 ANN M ASS COMP
[7]  
BJORNE J, 2008, P 3 INT S SEM MIN BI, P125
[8]   The use of the area under the roc curve in the evaluation of machine learning algorithms [J].
Bradley, AP .
PATTERN RECOGNITION, 1997, 30 (07) :1145-1159
[9]   Comparative experiments on learning information extractors for proteins and their interactions [J].
Bunescu, R ;
Ge, RF ;
Kate, RJ ;
Marcotte, EM ;
Mooney, RJ ;
Ramani, AK ;
Wong, YW .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155
[10]  
Bunescu R, 2005, HLT EMNLP 2005, P724