Resources for comparing the speed and performance of medical autocoders

被引:2
作者
Berman J.J. [1 ]
机构
[1] Cancer Diagnosis Program, National Cancer Institute, National Institutes of Health, Bethesda, MD
关键词
Machine Translation; Unify Medical Language System; Medical Text; Unique Word; Concept Index;
D O I
10.1186/1472-6947-4-8
中图分类号
学科分类号
摘要
Background: Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medical concept to all equivalent representations. A concept search on the term renal cell carcinoma would be expected to find occurrences of hypernephroma, and renal carcinoma (concept equivalents). The purpose of this study is to provide freely available resources to compare speed and performance among different autocoders. These tools consist of: 1) a public domain autocoder written in Perl (a free and open source programming language that installs on any operating system) 2) a nomenclature database derived from the unencumbered subset of the publicly available Unified Medical Language System; 3) a large corpus of autocoded output derived from a publicly available medical text. Methods: A simple lexical autocoder was written that parses plain-text into a listing of all 1,2,3, and 4-word strings contained in text, assigning a nomenclature code for text strings that match terms in the nomenclature. The nomenclature used is the unencumbered subset of the 2003 Unified Medical Language System (UMLS). The unencumbered subset of UMLS was reduced to exclude homonymous one-word terms and proper names, resulting in a term/code data dictionary containing about a half million medical terms. The Online Mendelian Inheritance in Man (OMIM), a 92+ Megabyte publicly available medical opus, was used as sample medical text for the autocoder. Results: The autocoding Perl script is remarkably short, consisting of just 38 command lines. The 92+ Megabyte OMIM file was completely autocoded in 869 seconds on a 2.4 GHz processor (less than 10 seconds per Megabyte of text). The autocoded output file (9,540,442 bytes) contains 367,963 coded terms from OMIM and is distributed with this manuscript. Conclusions: A public domain Perl script is provided that can parse through plaintext files of any length, matching concepts against an external nomenclature. The script and associated files can be used freely to compare the speed and performance of autocoding software.
引用
收藏
相关论文
共 11 条
[1]  
Berman J.J., Concept-match medical data scrubbing: How pathology datasets can be used in research, Arch Pathol Lab Med, 127, pp. 680-686, (2003)
[2]  
Berman J.J., Moore G.W., Donnelly W.H., Massey J.K., Craig B., SNOMED analysis of 40,124 surgical pathology cases, Am J Clin Pathol, 102, pp. 539-540, (1994)
[3]  
Berman J.J., Moore G.W., SNOMED-encoded surgical pathology databases: A tool for epidemiologic investigation, Mod Pathol, 9, pp. 944-950, (1996)
[4]  
Moore G.W., Berman J.J., Automatic SNOMED coding, Proc Annu Symp Comput Appl Med Care, pp. 225-229, (1994)
[5]  
Moore G.W., Berman J.J., Performance analysis of manual and automated systemized nomenclature of medicine (SNOMED) coding, Am J Clin Pathol, 101, pp. 253-256, (1994)
[6]  
Grivell L., Mining the bibliome: Searching for a needle in a haystack?, EMBO Reports, 3, pp. 200-203, (2002)
[7]  
Salton G., Allan J., Buckley C., Singhal A., Automatic analysis, theme generation, and summarization of machine-readable texts, Science, 264, pp. 1421-1426, (1994)
[8]  
Berman J.J., A tool for sharing annotated research data: The "Category 0" UMLS (Unified Medical Language System) vocabularies, BMC Med Inform Decis Mak, 3, (2003)
[9]  
Cantor M.N., Lussier Y.A., Putting data integration into practice: Using biomedical terminologies to add structure to existing data sources, Proc AMIA Symp, pp. 125-129, (2003)
[10]  
Herman J.J., Tumor classification: Molecular analysis meets Aristotle, BMC Cancer, 4, (2004)