Comparative experiments on learning information extractors for proteins and their interactions

被引:218
作者
Bunescu, R
Ge, RF
Kate, RJ
Marcotte, EM
Mooney, RJ [1 ]
Ramani, AK
Wong, YW
机构
[1] Univ Texas, Dept Comp Sci, Austin, TX 78712 USA
[2] Univ Texas, Inst Cellular & Mol Biol, Austin, TX 78712 USA
[3] Univ Texas, Ctr Computat Biol & Bioinformat, Austin, TX 78712 USA
基金
美国国家科学基金会;
关键词
information extraction; text mining; machine learning; protein interactions; Medline;
D O I
10.1016/j.artmed.2004.07.016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
Objective: Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medtine abstracts and subsequently extracting information on interactions between the proteins. Methods and Material: We used a variety of machine learning methods to automaticatly develop information extraction systems for extracting information on gene/ protein name, function and interactions from Medline abstracts. We present crossvalidated results on identifying human proteins and their interactions by training and testing on a set of approximately 1000 manuatly-annotated Medline abstracts that discuss human genes/proteins. Results: We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules. Conclusion: Our results show that it is promising to use machine learning to automatically build systems for extracting information from biomedical text. The results also give a broad picture of the relative strengths of a wide variety of methods when tested on a reasonably large human-annotated corpus. (c) 2004 Elsevier B.V. All rights reserved.
引用
收藏
页码:139 / 155
页数:17
相关论文
共 56 条
[1]
[Anonymous], 1998, GENOME INFORM
[2]
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[3]
An algorithm that learns what's in a name [J].
Bikel, DM ;
Schwartz, R ;
Weischedel, RM .
MACHINE LEARNING, 1999, 34 (1-3) :211-231
[4]
Blaschke C, 2002, IEEE INTELL SYST, V17, P14, DOI 10.1109/MIS.2002.999215
[5]
Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study [J].
Blaschke, C ;
Valencia, A .
COMPARATIVE AND FUNCTIONAL GENOMICS, 2001, 2 (04) :196-206
[6]
Brill E, 1995, COMPUT LINGUIST, V21, P543
[7]
Califf ME, 1999, SIXTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-99)/ELEVENTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE (IAAI-99), P328
[8]
CALIFF ME, 1999, AAAI1999 WORKSH MACH
[9]
Cardie C, 1997, AI MAG, V18, P65
[10]
Cestnik B., 1990, P EUR C ART INT, P147