BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature

被引：30

作者：

Kuo, Cheng-Ju ^{[1
]}

Ling, Maurice H. T. ^{[3
,4
]}

Lin, Kuan-Ting ^{[1
,2
]}

Hsu, Chun-Nan ^{[1
]}

机构：

[1] Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan

[2] Natl Yang Ming Univ, Inst Biomed Informat, Taipei 112, Taiwan

[3] Singapore Polytech, Sch Chem & Life Sci, Singapore, Singapore

[4] Univ Melbourne, Dept Zool, Parkville, Vic 3052, Australia

来源：

BMC BIOINFORMATICS | 2009年 / 10卷

关键词：

PROTEIN NAMES; GENE; IDENTIFICATION; DICTIONARY; FORMS;

D O I：

10.1186/1471-2105-10-S15-S7

中图分类号：

Q5 [生物化学];

学科分类号：

070307 [化学生物学];

摘要：

Background: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. Results: Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. Conclusion: By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/.

引用

页数：10

共 38 条

[1]

ADAMIC LA, 2002, IEEE COMP SOC C BIOI

[2]

SaRAD: a simple and robust abbreviation dictionary [J].