DNorm: disease name normalization with pairwise learning to rank

被引:408
作者
Leaman, Robert [1 ,2 ]
Dogan, Rezarta Islamaj [1 ]
Lu, Zhiyong [1 ]
机构
[1] Natl Ctr Biotechnol Informat, Bethesda, MD 20894 USA
[2] Arizona State Univ, Dept Biomed Informat, Scottsdale, AZ 85259 USA
基金
美国国家卫生研究院;
关键词
TERMS; TASK; TEXT;
D O I
10.1093/bioinformatics/btt474
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text-the task of disease name normalization (DNorm)-compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH (R) and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively.
引用
收藏
页码:2909 / 2917
页数:9
相关论文
共 41 条
[1]
[Anonymous], 2005, INT C MACH LEARN
[2]
[Anonymous], DATABASE
[3]
[Anonymous], 2001, PROC 18 INT C MACH L
[4]
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[5]
Learning to rank with (a lot of) word features [J].
Bai, Bing ;
Weston, Jason ;
Grangier, David ;
Collobert, Ronan ;
Sadamasa, Kunihiko ;
Qi, Yanjun ;
Chapelle, Olivier ;
Weinberger, Kilian .
INFORMATION RETRIEVAL, 2010, 13 (03) :291-314
[6]
Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations [J].
Biesecker, LG .
CLINICAL GENETICS, 2005, 68 (04) :320-326
[7]
Buyko E., 2007, PACLING 2007 P 10 C, P163
[8]
Collins M, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P263
[9]
MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database [J].
Davis, Allan Peter ;
Wiegers, Thomas C. ;
Rosenstein, Michael C. ;
Mattingly, Carolyn J. .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2012,
[10]
Dogan R. I., 2012, AAAI FALL S TECHNICA, P8