GAPSCORE:: finding gene and protein names one word at a time

被引:37
作者
Chang, JT
Schütze, H
Altman, RB
机构
[1] Stanford Univ, Med Ctr, Dept Genet, Stanford, CA 94305 USA
[2] Enkata Technol, San Mateo, CA 94403 USA
关键词
D O I
10.1093/bioinformatics/btg393
中图分类号
Q5 [生物化学];
学科分类号
071010 [生物化学与分子生物学]; 081704 [应用化学];
摘要
Motivation: New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context. Results: We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs.
引用
收藏
页码:216 / 225
页数:10
相关论文
共 44 条
[1]
BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[2]
Automated extraction of information in molecular biology [J].
Andrade, MA ;
Bork, P .
FEBS LETTERS, 2000, 476 (1-2) :12-17
[3]
[Anonymous], 1998, GENOME INFORM
[4]
[Anonymous], 1992, ENZYME NOMENCLATURE
[5]
THE SWISS-PROT PROTEIN-SEQUENCE DATA-BANK [J].
BAIROCH, A ;
BOECKMANN, B .
NUCLEIC ACIDS RESEARCH, 1991, 19 :2247-2248
[6]
Conceptual biology: Unearthing the gems [J].
Blagosklonny, MV ;
Pardee, AB .
NATURE, 2002, 416 (6879) :373-373
[7]
Blaschke C, 1999, Proc Int Conf Intell Syst Mol Biol, P60
[8]
BRILL E, 1994, P 12 NAT C ART INT
[9]
A tutorial on Support Vector Machines for pattern recognition [J].
Burges, CJC .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (02) :121-167
[10]
Creating an online dictionary of abbreviations from MEDLINE [J].
Chang, JT ;
Schütze, H ;
Altman, RB .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2002, 9 (06) :612-620