ProMiner: rule-based protein and gene entity recognition

被引:195
作者
Hanisch, D
Fundel, K
Mevissen, HT
Zimmer, R
Fluck, J
机构
[1] Fraunhofer Inst SCAI, D-53754 St Augustin, Germany
[2] Univ Munich, Inst Informat, D-80333 Munich, Germany
关键词
Acceptance Score; Spelling Variant; Token Class; False Positive Match; Approximate Search;
D O I
10.1186/1471-2105-6-S1-S14
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data. Methods: The ProMiner system uses a pre-processed synonym dictionary to identify potential name occurrences in the biomedical text and associate protein and gene database identifiers with the detected matches. It follows a rule-based approach and its search algorithm is geared towards recognition of multi-word names [1]. To account for the large number of ambiguous synonyms in the considered organisms, the system has been extended to use specific variants of the detection procedure for highly ambiguous and case-sensitive synonyms. Based on all detected synonyms for one abstract, the most plausible database identifiers are associated with the text. Organism specificity is addressed by a simple procedure based on additionally detected organism names in an abstract. Results: The extended ProMiner system has been applied to the test cases of the BioCreAtIvE competition with highly encouraging results. In blind predictions, the system achieved an F-measure of approximately 0.8 for the organisms mouse and fly and about 0.9 for the organism yeast.
引用
收藏
页数:9
相关论文
共 17 条
[1]  
[Anonymous], 1998, GENOME INFORM
[2]  
[Anonymous], P COLING
[3]   Creating an online dictionary of abbreviations from MEDLINE [J].
Chang, JT ;
Schütze, H ;
Altman, RB .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2002, 9 (06) :612-620
[4]   Data preparation and interannotator agreement: BioCreAtIvE task IB [J].
Colosimo, ME ;
Morgan, AA ;
Yeh, AS ;
Colombe, JB ;
Hirschman, L .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[5]   Automatically annotating documents with normalized gene lists [J].
Crim, J ;
McDonald, R ;
Pereira, F .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[6]  
FUKADA K, 1998, PAC S BIOC, P701
[7]   A simple approach for protein name identification:: prospects and limits [J].
Fundel, K ;
Güttler, D ;
Zimmer, R ;
Apostolakis, J .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[8]  
Hanisch Daniel, 2003, Pac Symp Biocomput, P403
[9]   Overview of BioCreAtIvE task IB: normalized gene lists [J].
Hirschman, L ;
Colosimo, M ;
Morgan, A ;
Yeh, A .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[10]   A literature network of human genes for high-throughput analysis of gene expression [J].
Jenssen, TK ;
Lægreid, A ;
Komorowski, J ;
Hovig, E .
NATURE GENETICS, 2001, 28 (01) :21-+