Protein names and how to find them

被引:55
作者
Franzén, K
Eriksson, G
Olsson, F
Asker, L
Lidén, P
Cöster, J
机构
[1] Swedish Inst Comp Sci, SE-16429 Kista, Sweden
[2] Virtual Genet Lab AB, SE-17177 Stockholm, Sweden
关键词
knowledge; linguistics; natural language processing; medical information science; computational molecular biology; information extraction; protein names;
D O I
10.1016/S1386-5056(02)00052-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A prerequisite for all higher level information extraction tasks is the identification of unknown names in text. Today, when large corpora can consist of billions of words, it is of utmost importance to develop accurate techniques for the automatic detection, extraction and categorization of named entities in these corpora. Although named entity recognition might be regarded a solved problem in some domains, it still poses a significant challenge in others. In this work we focus on one of the more difficult tasks, the identification of protein names in text. This task presents several interesting difficulties because of the named entities variant structural characteristics, their sometimes unclear status as names, the lack of common standards and fixed nomenclatures, and the specifics of the texts in the molecular biology domain in which they appear. We describe how we approached these and other difficulties in the implementation of Yapex, a system for the automatic identification of protein names in text. We also evaluate Yapex under four different notions of correctness and compare its performance to that of another publicly available system for protein name recognition. (C) 2002 Elsevier Science Ireland Ltd. All rights reserved.
引用
收藏
页码:49 / 61
页数:13
相关论文
共 16 条
  • [1] [Anonymous], P 5 NLPRS
  • [2] [Anonymous], P COLING
  • [3] The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
    Bairoch, A
    Apweiler, R
    [J]. NUCLEIC ACIDS RESEARCH, 2000, 28 (01) : 45 - 48
  • [4] Borthwick A., 1998, 6 WORKSH VER LARG CO
  • [5] Collier N, 1999, NINTH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS, P271
  • [6] DEBRUIJN B, 2000, 8 INT C INT SYST MOL
  • [7] FUKUDA K, 1998, PAC S BIOCOMPUT, V3, P705
  • [8] Gaizauskas R, 2001, P WORKSH CHEM DAT AN
  • [9] Grishman R., 1997, Information Extraction. A Multidisciplinary Approach to an Emerging Information Technology International Summer School, SCIE-97, P10
  • [10] OLSSON F, 2001, ERCIM NEWS, V46