PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine

被引:176
作者
Donaldson, I
Martin, J
de Bruijn, B
Wolting, C
Lay, V
Tuekam, B
Zhang, SD
Baskin, B
Bader, GD
Michalickova, K
Pawson, T
Hogue, CWV
机构
[1] Mt Sinai Hosp, Samuel Lunenfeld Res Inst, Toronto, ON M5G 1X5, Canada
[2] Natl Res Council Canada, Inst Informat Technol, Ottawa, ON K1A 0R6, Canada
[3] MDS Proteom Inc, Toronto, ON M9W 7H4, Canada
[4] Univ Toronto, Dept Biochem, Toronto, ON, Canada
[5] Univ Toronto, Dept Mol & Med Genet, Toronto, ON, Canada
关键词
D O I
10.1186/1471-2105-4-11
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database ( BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. Results: Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days. Conclusions: Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.
引用
收藏
页数:13
相关论文
共 32 条
[1]  
ANDREW KM, 1996, BOW TOOLKIT STAT LAN
[2]   BIND - a data specification for storing and describing biomolecular interactions, molecular complexes and pathways [J].
Bader, GD ;
Hogue, CWV .
BIOINFORMATICS, 2000, 16 (05) :465-477
[3]   Analyzing yeast protein-protein interaction data obtained from different sources [J].
Bader, GD ;
Hogue, CWV .
NATURE BIOTECHNOLOGY, 2002, 20 (10) :991-997
[4]   BIND - The Biomolecular Interaction Network Database [J].
Bader, GD ;
Donaldson, I ;
Wolting, C ;
Ouellette, BFF ;
Pawson, T ;
Hogue, CWV .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :242-245
[5]  
Blaschke C, 2001, Genome Inform, V12, P123
[6]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[7]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[8]  
DEBRUIJN B, 2001, P ASIST ANNU MEET, P450
[9]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[10]   Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) [J].
Dwight, SS ;
Harris, MA ;
Dolinski, K ;
Ball, CA ;
Binkley, G ;
Christie, KR ;
Fisk, DG ;
Issel-Tarver, L ;
Schroeder, M ;
Sherlock, G ;
Sethuraman, A ;
Weng, S ;
Botstein, D ;
Cherry, JM .
NUCLEIC ACIDS RESEARCH, 2002, 30 (01) :69-72