A comparison study on algorithms of detecting long forms for short forms in biomedical text

被引:16
作者
Torii, Manabu [1 ]
Hu, Zhang-zhi [2 ]
Song, Min [3 ]
Wu, Cathy H. [2 ]
Liu, Hongfang [1 ]
机构
[1] Georgetown Univ, Med Ctr, Dept Biostat Bioinformat & Biomath, Washington, DC 20057 USA
[2] Georgetown Univ, Med Ctr, Dept Biochem & Mol & Cell Biol, Washington, DC 20007 USA
[3] Univ Hts, New Jersey Inst Technol, Dept Informat Syst, Newark, NJ 07102 USA
关键词
D O I
10.1186/1471-2105-8-S9-S5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs) and their corresponding definitions as long forms (LFs). The study was designed to answer the following questions; i) how well a system performs in detecting LFs from novel text, ii) what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii) how to combine results from various SF knowledge bases. Method: We evaluated the following three publicly available detection systems in detecting LFs for SFs: i) a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii) a machine learning system by Chang et al., and iii) a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i) the UMLS (the Unified Medical Language System), and ii) the BioThesaurus (a thesaurus of names for all UniProt protein records). We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results: We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases.
引用
收藏
页数:9
相关论文
共 23 条
  • [1] SaRAD: a simple and robust abbreviation dictionary
    Adar, E
    [J]. BIOINFORMATICS, 2004, 20 (04) : 527 - 533
  • [2] ALICE: An algorithm to extract abbreviations from MEDLINE
    Ao, H
    Takagi, TI
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2005, 12 (05) : 576 - 586
  • [3] The universal protein resource (UniProt)
    Bairoch, A
    Apweiler, R
    Wu, CH
    Barker, WC
    Boeckmann, B
    Ferro, S
    Gasteiger, E
    Huang, HZ
    Lopez, R
    Magrane, M
    Martin, MJ
    Natale, DA
    O'Donovan, C
    Redaschi, N
    Yeh, LSL
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D154 - D159
  • [4] Bloom DA, 2000, BJU INT, V86, P1
  • [5] The Unified Medical Language System (UMLS): integrating biomedical terminology
    Bodenreider, O
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D267 - D270
  • [6] Creating an online dictionary of abbreviations from MEDLINE
    Chang, JT
    Schütze, H
    Altman, RB
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2002, 9 (06) : 612 - 620
  • [7] EYRE TA, 2006, NUCLEIC ACIDS RES, pE319
  • [8] Rutabaga by any other name: extracting biological names
    Hirschman, L
    Morgan, AA
    Yeh, AS
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2002, 35 (04) : 247 - 259
  • [9] Biomedical language processing: What's beyond PubMed?
    Hunter, L
    Cohen, KB
    [J]. MOLECULAR CELL, 2006, 21 (05) : 589 - 594
  • [10] Disambiguating ambiguous biomedical terms in biomedical narrative text: An unsupervised method
    Liu, HF
    Lussier, YA
    Friedman, C
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (04) : 249 - 261