AN EXAMPLE-BASED MAPPING METHOD FOR TEXT CATEGORIZATION AND RETRIEVAL

被引:206
作者
YANG, YM
CHUTE, CG
机构
[1] Section of Medical Information Resources, Mayo Clinic/Foundation, Rochester
关键词
DOCUMENT CATEGORIZATION; QUERY CATEGORIZATION; STATISTICAL LEARNING OF HUMAN DECISIONS;
D O I
10.1145/183422.183424
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit (LLSF) technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping leads to a significant improvement in categorization and retrieval, compared to alternative approaches.
引用
收藏
页码:252 / 277
页数:26
相关论文
共 22 条
[1]  
CHUTE CG, 1992, 16TH P ANN S COMP AP, V16, P639
[2]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[3]  
2-9
[4]  
Dongarra J. J., 1979, LINPACK USERS GUIDE
[5]  
EVANS DA, 1992, MEDINFO, V92, P1462
[6]  
EVANS DA, 1991, MED DECISION MAKIN S, V11, P108
[7]   A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING [J].
FUHR, N ;
BUCKLEY, C .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1991, 9 (03) :223-248
[8]  
FUHR N, 1991, P RIAO 91, P606
[9]  
Golub G.H., 1996, MATH GAZ, VThird
[10]   ONLINE ACCESS TO MEDLINE IN CLINICAL SETTINGS - A STUDY OF USE AND USEFULNESS [J].
HAYNES, RB ;
MCKIBBON, KA ;
WALKER, CJ ;
RYAN, N ;
FITZGERALD, D ;
RAMSDEN, MF .
ANNALS OF INTERNAL MEDICINE, 1990, 112 (01) :78-84