Feature selection with a measure of deviations from Poisson in text categorization

被引:67
作者
Ogura, Hiroshi [1 ]
Amano, Hiromi [1 ]
Kondo, Masato [1 ]
机构
[1] Showa Univ, Fac Arts & Sci, Dept Informat Sci, Fujiyoshida, Yamanashi 4030005, Japan
关键词
Text categorization; Feature selection; Poisson distribution; Support vector machine; k-NN classifier; ALGORITHM;
D O I
10.1016/j.eswa.2008.08.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To improve the performance of automatic text classification, it is desirable to reduce a high dimensionality of the feature space. In this paper, we propose a new measure for selecting features, which estimates term importance based on how largely the probability distribution of each term deviates from the standard Poisson distribution. In information retrieval literatures, the deviation from Poisson has been used as a measure for weighting keywords and this motivates us to adopt the deviation from Poisson as a measure for feature selection in text classification tasks. The proposed measure is constructed so as to have the same computational complexity with other standard measures used for feature selection. To test the effectiveness of our method, we conducted evaluation experiments on Reuters-21578 corpus with support vector machine and k-NN classifiers. In the experiments, we performed binary classifications to determine whether each of the test documents belongs to a certain target category or not. For the target category, each of the top 10 categories of Reuters-21578 was used because of enough numbers of training and test documents. Four measures were used for feature selection; information gain (IG), chi(2)-statistic, Gini index and the proposed measure in this work. Both the proposed measure and Gini index proved to be better than IG and chi(2)-statistic in terms of macro-averaged and micro-averaged values of F-1, especially at higher vocabulary reduction levels. (c) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:6826 / 6832
页数:7
相关论文
共 15 条
  • [1] [Anonymous], Journal of machine learning research
  • [2] [Anonymous], 2006, Introduction to Data Mining
  • [3] [Anonymous], P 14 INT C MACH LEAR
  • [4] [Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
  • [5] Church K. W., 1995, Natural Language Engineering, V1, P163, DOI DOI 10.1017/S1351324900000139
  • [6] Church K.W., 1995, Proceedings of the Third Workshop on Very Large Corpora, P121, DOI DOI 10.1007/978-94-017-2390-9_18
  • [7] FORMAN G, 2007, COMPUTATIONAL METHOD, P57
  • [8] Jansche M, 2003, 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P288
  • [9] Joachims T., 1998, MACHINE LEARNING ECM, P137, DOI [10.1007/BFb0026683, DOI 10.1007/BFB0026683]
  • [10] Larsen RichardJ., 2006, An Introduction to Mathematical Statistics and Its Applications