BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature

被引:30
作者
Kuo, Cheng-Ju [1 ]
Ling, Maurice H. T. [3 ,4 ]
Lin, Kuan-Ting [1 ,2 ]
Hsu, Chun-Nan [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan
[2] Natl Yang Ming Univ, Inst Biomed Informat, Taipei 112, Taiwan
[3] Singapore Polytech, Sch Chem & Life Sci, Singapore, Singapore
[4] Univ Melbourne, Dept Zool, Parkville, Vic 3052, Australia
关键词
PROTEIN NAMES; GENE; IDENTIFICATION; DICTIONARY; FORMS;
D O I
10.1186/1471-2105-10-S15-S7
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Background: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools. Results: Our approach to abbreviation recognition (AR) is based on machine-learning, which exploits a novel set of rich features to learn rules from training data. Tested on the AB3P corpus, our system demonstrated a F-score of 89.90% with 95.86% precision at 84.64% recall, higher than the result achieved by the existing best AR performance system. We also annotated a new corpus of 1200 PubMed abstracts which was derived from BioCreative II gene normalization corpus. On our annotated corpus, our system achieved a F-score of 86.20% with 93.52% precision at 79.95% recall, which also outperforms all tested systems. Conclusion: By applying our system to extract all short form-long form pairs from all available PubMed abstracts, we have constructed BIOADI. Mining BIOADI reveals many interesting trends of bio-medical research. Besides, we also provide an off-line AR software in the download section on http://bioagent.iis.sinica.edu.tw/BIOADI/.
引用
收藏
页数:10
相关论文
共 38 条
[1]
ADAMIC LA, 2002, IEEE COMP SOC C BIOI
[2]
SaRAD: a simple and robust abbreviation dictionary [J].
Adar, E .
BIOINFORMATICS, 2004, 20 (04) :527-533
[3]
[Anonymous], 1998, Genome Inform Ser Workshop Genome Inform
[4]
LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5]
GAPSCORE:: finding gene and protein names one word at a time [J].
Chang, JT ;
Schütze, H ;
Altman, RB .
BIOINFORMATICS, 2004, 20 (02) :216-225
[6]
Creating an online dictionary of abbreviations from MEDLINE [J].
Chang, JT ;
Schütze, H ;
Altman, RB .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2002, 9 (06) :612-620
[7]
CHANG JT, 2004, THESIS STANFORD
[8]
A survey of current work in biomedical text mining [J].
Cohen, AM ;
Hersh, WR .
BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) :57-71
[9]
Exploring the boundaries: gene and protein identification in biomedical text [J].
Finkel, J ;
Dingare, S ;
Manning, CD ;
Nissim, M ;
Alex, B ;
Grover, C .
BMC BIOINFORMATICS, 2005, 6 (Suppl 1)
[10]
Protein names and how to find them [J].
Franzén, K ;
Eriksson, G ;
Olsson, F ;
Asker, L ;
Lidén, P ;
Cöster, J .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2002, 67 (1-3) :49-61