Extracting human protein interactions from MEDLINE using a full-sentence parser

被引:154
作者
Daraselia, N [1 ]
Yuryev, A [1 ]
Egorov, S [1 ]
Novichkova, S [1 ]
Nikitin, A [1 ]
Mazo, I [1 ]
机构
[1] Ariadne Genom Inc, Rockville, MD 20850 USA
关键词
D O I
10.1093/bioinformatics/btg452
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The living cell is a complex machine that depends on the proper functioning of its numerous parts, including proteins. Understanding protein functions and how they modify and regulate each other is the next great challenge for life-sciences researchers. The collective knowledge about protein functions and pathways is scattered throughout numerous publications in scientific journals. Bringing the relevant information together becomes a bottleneck in a research and discovery process. The volume of such information grows exponentially, which renders manual curation impractical. As a viable alternative, automated literature processing tools could be employed to extract and organize biological data into a knowledge base, making it amenable to computational analysis and data mining. Results: We present MedScan, a completely automated natural language processing-based information extraction system. We have used MedScan to extract 2976 interactions between human proteins from MEDLINE abstracts dated after 1988. The precision of the extracted information was found to be 91%. Comparison with the existing protein interaction databases BIND and DIP revealed that 96% of extracted information is novel. The recall rate of MedScan was found to be 21%. Additional experiments with MedScan suggest that MEDLINE is a unique source of diverse protein function information, which can be extracted in a completely automated way with a reasonably high precision. Further directions of the MedScan technology improvement are discussed.
引用
收藏
页码:604 / U43
页数:31
相关论文
共 17 条
[1]   Gene Ontology: tool for the unification of biology [J].
Ashburner, M ;
Ball, CA ;
Blake, JA ;
Botstein, D ;
Butler, H ;
Cherry, JM ;
Davis, AP ;
Dolinski, K ;
Dwight, SS ;
Eppig, JT ;
Harris, MA ;
Hill, DP ;
Issel-Tarver, L ;
Kasarskis, A ;
Lewis, S ;
Matese, JC ;
Richardson, JE ;
Ringwald, M ;
Rubin, GM ;
Sherlock, G .
NATURE GENETICS, 2000, 25 (01) :25-29
[2]  
Blaschke C, 2002, IEEE INTELL SYST, V17, P14, DOI 10.1109/MIS.2002.999215
[3]  
Blaschke C, 1999, Proc Int Conf Intell Syst Mol Biol, P60
[4]  
CHEN RO, 1997, ISMB, V5, P84
[5]  
Friedman C, 2001, Bioinformatics, V17 Suppl 1, pS74
[6]  
Humphreys K, 2000, Pac Symp Biocomput, P505
[7]   Eco Cyc:: Encyclopedia of Escherichia coli genes and metabolism [J].
Karp, PD ;
Riley, M ;
Paley, SM ;
Pellegrini-Toole, A ;
Krummenacker, M .
NUCLEIC ACIDS RESEARCH, 1999, 27 (01) :55-58
[8]   MedScan, a natural language processing engine for MEDLINE abstracts [J].
Novichkova, S ;
Egorov, S ;
Daraselia, N .
BIOINFORMATICS, 2003, 19 (13) :1699-1706
[9]   Automated extraction of information on protein-protein interactions from the biological literature [J].
Ono, T ;
Hishigaki, H ;
Tanigami, A ;
Takagi, T .
BIOINFORMATICS, 2001, 17 (02) :155-161
[10]  
Park J C, 2001, Pac Symp Biocomput, P396