Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

被引:66
作者
Shatkay, Hagit [1 ]
Pan, Fengxia [1 ]
Rzhetsky, Andrey [2 ,3 ,4 ]
Wilbur, W. John [5 ]
机构
[1] Queens Univ, Sch Comp, Computat Biol & Machine Learning Lab, Kingston, ON, Canada
[2] Univ Chicago, Dept Med, Chicago, IL 60637 USA
[3] Univ Chicago, Computat Inst, Dept Human Genet, Chicago, IL 60637 USA
[4] Univ Chicago, Inst Genom & Syst Biol, Chicago, IL 60637 USA
[5] NIH, Natl Lib Med, Natl Ctr Biotechnol Informat, Bethesda, MD 20892 USA
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1093/bioinformatics/btn381
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no average biologist client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice.
引用
收藏
页码:2086 / 2093
页数:8
相关论文
共 29 条
[1]  
BLASCHKE C, 1999, P INT C INT SYST MOL
[2]   Learning multi-label scene classification [J].
Boutell, MR ;
Luo, JB ;
Shen, XP ;
Brown, CM .
PATTERN RECOGNITION, 2004, 37 (09) :1757-1771
[3]  
CHANG CC, 2001, LIBSVM LIB SUPP VECT
[4]  
COHEN KB, 2005, AL SYSTEMS BIOL, P147
[5]  
CRAVEN M, 1999, P INT C INT SYST MOL
[6]  
Friedman C, 2001, Bioinformatics, V17 Suppl 1, pS74
[7]  
GHAMRAWI N, 2005, P C INF KNOWL MAN CI
[8]  
Hastie T., 2003, The Elements of Statistical Learning: Data Mining, Inference, and Prediction
[9]  
HERSH RT, 2006, P TEXT RETR C TREC 0
[10]   Text-mining and information-retrieval services for molecular biology [J].
Krallinger, M ;
Valencia, A .
GENOME BIOLOGY, 2005, 6 (07)