Quantifying the Impact and Extent of Undocumented Biomedical Synonymy

被引:6
作者
Blair, David R. [1 ,2 ]
Wang, Kanix [1 ,2 ]
Nestorov, Svetlozar [3 ]
Evans, James A. [3 ,4 ]
Rzhetsky, Andrey [1 ,2 ,3 ,5 ,6 ]
机构
[1] Univ Chicago, Inst Genom & Syst Biol, Chicago, IL 60637 USA
[2] Univ Chicago, Comm Genet Genom & Syst Biol, Chicago, IL 60637 USA
[3] Univ Chicago, Computat Inst, Chicago, IL 60637 USA
[4] Univ Chicago, Dept Sociol, Chicago, IL 60637 USA
[5] Univ Chicago, Dept Med, Chicago, IL 60637 USA
[6] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
关键词
GENOME-WIDE ASSOCIATION; FUNCTIONAL NETWORK; NUMBER; EXTRACTION; GENES; UMLS; ONTOLOGIES; RESOURCES; VARIANTS; STANDARD;
D O I
10.1371/journal.pcbi.1003799
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through "crowd-sourcing.'' Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (. 90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for "next-generation,'' high-coverage lexical terminologies.
引用
收藏
页数:17
相关论文
共 78 条
[1]
Latent class modeling approaches for assessing diagnostic error without a gold standard: With applications to p53 immunohistochemical assays in bladder tumors [J].
Albert, PS ;
McShane, LM ;
Shih, JH .
BIOMETRICS, 2001, 57 (02) :610-619
[2]
[Anonymous], 1998, EMPIRICAL STUDY SMOO
[3]
[Anonymous], 2001, Scholastic dictionary of synonyms, antonyms
[4]
Aronson AR, 2001, J AM MED INFORM ASSN, P17
[5]
Attias H, 2000, ADV NEUR IN, V12, P209
[6]
DXPLAIN - AN EVOLVING DIAGNOSTIC DECISION-SUPPORT SYSTEM [J].
BARNETT, GO ;
CIMINO, JJ ;
HUPP, JA ;
HOFFER, EP .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 1987, 258 (01) :67-74
[7]
PRICING OF OPTIONS AND CORPORATE LIABILITIES [J].
BLACK, F ;
SCHOLES, M .
JOURNAL OF POLITICAL ECONOMY, 1973, 81 (03) :637-654
[8]
A Nondegenerate Code of Deleterious Variants in Mendelian Loci Contributes to Complex Disease Risk [J].
Blair, David R. ;
Lyttle, Christopher S. ;
Mortensen, Jonathan M. ;
Bearden, Charles F. ;
Jensen, Anders Boeck ;
Khiabanian, Hossein ;
Melamed, Rachel ;
Rabadan, Raul ;
Bernstam, Elmer V. ;
Brunak, Soren ;
Jensen, Lars Juhl ;
Nicolae, Dan ;
Shah, Nigam H. ;
Grossman, Robert L. ;
Cox, Nancy J. ;
White, Kevin P. ;
Rzhetsky, Andrey .
CELL, 2013, 155 (01) :70-80
[9]
The Unified Medical Language System (UMLS): integrating biomedical terminology [J].
Bodenreider, O .
NUCLEIC ACIDS RESEARCH, 2004, 32 :D267-D270
[10]
ESTIMATING THE NUMBER OF SPECIES - A REVIEW [J].
BUNGE, J ;
FITZPATRICK, M .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) :364-373