Using BLAST for identifying gene and protein names in journal articles

被引:109
作者
Krauthammer, M [1 ]
Rzhetsky, A
Morozov, P
Friedman, C
机构
[1] Columbia Univ, Dept Med Informat, New York, NY 10027 USA
[2] Columbia Univ, Columbia Genome Ctr, New York, NY 10027 USA
[3] CUNY, Queens Coll, Dept Comp Sci, New York, NY USA
关键词
natural language processing; regulatory pathways; sequence comparison tools; string matching;
D O I
10.1016/S0378-1119(00)00431-5
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
We describe a system which automatically identifies gene and protein names in journal articles, an important and non-trivial first step in knowledge extraction of protein and gene actions. Our system uses a database of gene and protein names and is based on BLAST [Altschul et al., Nucleic Acids Res. 25 (1997) 3389-3402, a popular tool for DNA and protein sequence comparison. We describe a method that consists of mapping sequences of text characters into sequences of nucleotides that can be processed by BLAST. We demonstrate that this approach is feasible: the system matches gene and protein names with a recall of 78.8% and a precision of 71.7% which includes names that are not part of the system database. An analysis of the results suggests techniques that can be used to improve performance further. (C) 2000 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:245 / 252
页数:8
相关论文
共 15 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]  
ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
[3]   GenBank [J].
Benson, DA ;
Boguski, MS ;
Lipman, DJ ;
Ostell, J ;
Ouellette, BFF .
NUCLEIC ACIDS RESEARCH, 1998, 26 (01) :1-7
[4]  
CHRISTIANSEN T, 1998, PERL COOKBOOK
[5]  
CRICK FH, 1957, P NATL ACAD SCI USA, P416
[6]  
Friedman C, 1997, J AM MED INFORM ASSN, P595
[7]  
Fukuda K, 1998, Pac Symp Biocomput, P707
[8]  
Gusfield D, 1997, ALGORITHMS STRINGS T
[9]   UNLOCKING CLINICAL-DATA FROM NARRATIVE REPORTS - A STUDY OF NATURAL-LANGUAGE PROCESSING [J].
HRIPCSAK, G ;
FRIEDMAN, C ;
ALDERSON, PO ;
DUMOUCHEL, W ;
JOHNSON, SB ;
CLAYTON, PD .
ANNALS OF INTERNAL MEDICINE, 1995, 122 (09) :681-688
[10]  
Nei M., 1987, Science, Philosophy and Human Behavior in the Soviet Union