Analytical evaluation of term weighting schemes for text categorization

被引:43
作者
Altincay, Hakan [1 ]
Erenel, Zafer [1 ]
机构
[1] Eastern Mediterranean Univ, Dept Comp Engn, Famagusta, Northern Cyprus, Turkey
关键词
Contour lines; Term occurrence probability; Term weighting; Relative weights; Text categorization; FEATURE-SELECTION; CLASSIFICATION;
D O I
10.1016/j.patrec.2010.03.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
An analytical evaluation of six widely used term weighting techniques for text categorization is presented. The analysis depends on expressing the term weights using term occurrence probabilities in positive and negative categories. The weighting behaviors of the schemes considered are firstly clarified by analyzing the relation between the occurrence probabilities of terms which receive equal weights. Then, the weights are expressed in terms of ratio and difference of term occurrence probabilities where the similarities and differences among different schemes are revealed. Simulations show that the relative performance of different schemes can be explained by the ways they use ratio and difference of term occurrence probabilities in generating the term weights. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:1310 / 1323
页数:14
相关论文
共 28 条
[1]  
[Anonymous], 2001, Pattern Classification
[2]  
Buckley C., 1986, Implementation of the SMART information retrieval system
[3]   Feature selection for text classification with Naive Bayes [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Shengfeng ;
Qu, Youli .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435
[4]  
CRAVEN M, 1998, P 1998 NAT C ART INT
[5]   An analysis of the relative hardness of Reuters-21578 subsets [J].
Debole, F ;
Sebastiani, F .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (06) :584-596
[6]  
Debole F, 2004, STUD FUZZ SOFT COMP, V138, P81
[7]  
ERENEL Z, 2009, P 5 INT C SOFT COMP
[8]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[9]  
Forman G, 2008, CH CRC DATA MIN KNOW, P257
[10]  
Guo GD, 2004, LECT NOTES COMPUT SC, V2945, P559