Creating an online dictionary of abbreviations from MEDLINE

被引:99
作者
Chang, JT
Schütze, H
Altman, RB
机构
[1] Stanford Univ, Dept Genet, Sch Med, Stanford, CA 94305 USA
[2] Novat Biosci, Stanford, CA USA
关键词
D O I
10.1197/jamia.M1139
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective. The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions. Design. Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune. Measurements. We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database. Results. On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database. Conclusion. We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url{http: / /abbreviation.stanford.edu/}.
引用
收藏
页码:612 / 620
页数:9
相关论文
共 16 条
  • [1] ANDRADE MA, 1997, ISMB, V5, P25
  • [2] Hastie T, 2008, The elements of statistical learning, Vsecond, DOI DOI 10.1007/978-0-387-21606-5
  • [3] Iliopoulos I, 2001, Pac Symp Biocomput, P384
  • [4] JABLONSKI S, 1998, DICT MED ACRONYMS AB
  • [5] A literature network of human genes for high-throughput analysis of gene expression
    Jenssen, TK
    Lægreid, A
    Komorowski, J
    Hovig, E
    [J]. NATURE GENETICS, 2001, 28 (01) : 21 - +
  • [6] Knuth Donald Ervin, 1986, TEXBOOK
  • [7] Larkey L. S., 2000, ACM 2000. Digital Libraries. Proceedings of the Fifth ACM Conference on Digital Libraries, P205, DOI 10.1145/336597.336664
  • [8] Liu HF, 2001, J AM MED INFORM ASSN, P393
  • [9] LUTZ M, 1999, LEARNIGN PYTHON
  • [10] A GENERAL METHOD APPLICABLE TO SEARCH FOR SIMILARITIES IN AMINO ACID SEQUENCE OF 2 PROTEINS
    NEEDLEMAN, SB
    WUNSCH, CD
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1970, 48 (03) : 443 - +