Rough set-aided keyword reduction for text categorization

被引:204
作者
Chouchoulas, A [1 ]
Shen, Q [1 ]
机构
[1] Univ Edinburgh, Div Informat, Inst Representat & Reasoning, Edinburgh EH1 1QN, Midlothian, Scotland
关键词
D O I
10.1080/088395101753210773
中图分类号
TP18 [人工智能理论];
学科分类号
081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];
摘要
The volume of electronically stored information increases exponentially as the state of the art progresses. Automated information filtering (IF) and information retrieval (IR) systems are therefore acquiring rapidly increasing prominence. However, such systems sacrifice efficiency to boost effectiveness. Such systems typically have to cope with sets of rectors of many tens of thousands of dimensions. Rough set (RS) theory can be applied to reducing the dimensionality of data used in IF/IR tasks, by providing a measure of the information content of datasets with respect to a given classification. This can aid IF/IR systems that rely on the acquisition of large numbers of term weights or other measures of relevance. This article investigates the applicability of RS theory to the IF/IR application domain and compares this applicability with respect to various existing TC techniques. The ability, of the approach to generalize, given a minimum of training data is also addressed. The background of RS theory is presented, with an illustrative example to demonstrate the operation of the RS-based dimensionality reduction. A modular system is proposed which allows the integration of this technique with a large variety of different IF/IR approaches. The example application, categorization of E-mail messages, is described. Systematic experiments and their results are reported and analyzed.
引用
收藏
页码:843 / 873
页数:31
相关论文
共 18 条
[1]
[Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
[2]
Chouchoulas A, 1999, LECT NOTES ARTIF INT, V1711, P118
[3]
DASGUPTA P, 1988, P 11 ANN INT ACM SIG, P567
[4]
HUNT C, 1997, TCP IP NETWORK ADM
[5]
ROUGH SET REDUCTION OF ATTRIBUTES AND THEIR DOMAINS FOR NEURAL NETWORKS [J].
JELONEK, J ;
KRAWIEC, K ;
SLOWINSKI, R .
COMPUTATIONAL INTELLIGENCE, 1995, 11 (02) :339-347
[6]
Kasabov N.K., 1998, Foundations of Neural Networks, Fuzzy Systems, And Knowledge Engineering, V2nd
[7]
KEEN D, 1994, P WORKSH INC UNC INF, P87
[8]
Larson K, 1998, P CHI 98 HUM FACT CO, P25, DOI DOI 10.1145/274644.274649
[9]
MARTIENNE E, 1998, P 7 IEEE INT C FUZZ
[10]
Mladenic D, 1999, MACHINE LEARNING, PROCEEDINGS, P258