All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning

被引：137

作者：

Airola, Antti ^{[1
]}

Pyysalo, Sampo

Bjoerne, Jari

Pahikkala, Tapio

Ginter, Filip

Salakoski, Tapio

机构：

[1] Univ Turku, Turku Ctr Comp Sci TUCS, FIN-20520 Turku, Finland

来源：

BMC BIOINFORMATICS | 2008年 / 9卷 / Suppl 11期

关键词：

D O I：

10.1186/1471-2105-9-S11-S2

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure. Results: We evaluate the proposed method on five publicly available PPI corpora, providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. We additionally perform a detailed evaluation of the effects of training and testing on different resources, providing insight into the challenges involved in applying a system beyond the data it was trained on. Our method is shown to achieve state-of-the-art performance with respect to comparable evaluations, with 56.4 F-score and 84.8 AUC on the AImed corpus. Conclusion: We show that the graph kernel approach performs on state-of-the-art level in PPI extraction, and note the possible extension to the task of extracting complex interactions. Cross-corpus results provide further insight into how the learning generalizes beyond individual corpora. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect cross-validation strategies and problems related to comparing F-score results achieved on different evaluation resources. Recommendations for avoiding these pitfalls are provided.

引用

页数：12

共 40 条

[1]

[Anonymous], ADV NEURAL INFORM PR

[2]

[Anonymous], P 3 INT S SEM MIN BI

[3]

[Anonymous], 2005, P 1 INT S SEM MIN BI

[4]

[Anonymous], 2008, Proc. of the workshop on current trends in biomedical natural language processing

[5]

[Anonymous], 2006, AGROBIZNES

[6]

[Anonymous], 2008, P 46 ANN M ASS COMP

[7]

BJORNE J, 2008, P 3 INT S SEM MIN BI, P125

[8] The use of the area under the roc curve in the evaluation of machine learning algorithms [J].

Bradley, AP .

PATTERN RECOGNITION, 1997, 30 (07) :1145-1159

[9] Comparative experiments on learning information extractors for proteins and their interactions [J].

Bunescu, R ;

Ge, RF ;

Kate, RJ ;

Marcotte, EM ;

Mooney, RJ ;

Ramani, AK ;

Wong, YW .

ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (02) :139-155

[10]

Bunescu R, 2005, HLT EMNLP 2005, P724

← 1 2 3 4 →