Information gain and divergence-based feature selection for machine learning-based text categorization

被引:266
作者
Lee, CK [1 ]
Lee, GG [1 ]
机构
[1] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Pohang 790784, South Korea
关键词
text categorization; feature selection; information gain and divergence-based feature selection;
D O I
10.1016/j.ipm.2004.08.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami's method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy. (c) 2004 Elsevier Ltd. All rights reserved.
引用
收藏
页码:155 / 165
页数:11
相关论文
共 15 条
[1]  
[Anonymous], BOW TOOLKIT STAT LAN
[2]  
COOPER WS, 1991, P 14 ACM SIGIR INT C
[3]  
Goldstein Jad, 1998, P 21 ACM SIGIR INT C
[4]  
JOACHIMS T, 1997, P ICML 97 14 INT C M
[5]  
JOACHIMS T, 2001, P 24 ACM SIGIR INT C
[6]  
Joachims Thorsten, 1998, P ECML 98 10 EUR C M, P137
[7]  
Koller D., 1996, P ICML 96 13 INT C M
[8]  
LIEIS DD, 1994, P SDAIR 94 3 ANN S D
[9]  
McCallum Andrew, 1998, AAAI 1998
[10]  
PIETRA SD, 1997, IEEE T PATTERN ANAL