Machine learning with naturally labeled data for identifying abbreviation definitions

被引:9
作者
Yeganova, Lana [1 ]
Comeau, Donald C. [1 ]
Wilbur, W. John [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20892 USA
关键词
Short Form; Familial Mediterranean Fever; Germinal Vesicle; Feature Weight; Long Form;
D O I
10.1186/1471-2105-12-S3-S6
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Background: The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data. Methods: In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. Results: We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.
引用
收藏
页数:8
相关论文
共 15 条
[1]
Chang J, 2002, CREATING ONLINE DICT, P612
[2]
Islamaj R, 2009, DATABASE OXFORD
[3]
BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature [J].
Kuo, Cheng-Ju ;
Ling, Maurice H. T. ;
Lin, Kuan-Ting ;
Hsu, Chun-Nan .
BMC BIOINFORMATICS, 2009, 10
[4]
Liu HF, 2002, AMIA 2002 SYMPOSIUM, PROCEEDINGS, P464
[5]
Nadeau D, 2005, LECT NOTES COMPUT SC, V3501, P319
[6]
Nadeau D, 2001, STUD HLTH TECHNOL 1, V84, P371
[7]
OKAZAKI N, 2006, P 21 INT C COMP LING, P643
[8]
Park Y, 2001, PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P126
[9]
Pustejovsky J, 2001, ST HEAL T, V84, P371
[10]
Schwartz Ariel S, 2003, Pac Symp Biocomput, P451