Using the absolute difference of term occurrence probabilities in binary text categorization

被引:5
作者
Altincay, Hakan [1 ]
Erenel, Zafer [1 ]
机构
[1] Eastern Mediterranean Univ, Dept Comp Engn, Gazimagusa, Cyprus
关键词
Term occurrence probability; Term weighting; Relevance frequency; Mutual information; Chi-square; Odds ratio; Text categorization; FEATURE-SELECTION; CLASSIFICATION;
D O I
10.1007/s10489-010-0250-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this study, the differences among widely used weighting schemes are studied by means of ordering terms according to their discriminative abilities using a recently developed framework which expresses term weights in terms of the ratio and absolute difference of term occurrence probabilities. Having observed that the ordering of terms is dependent on the weighting scheme under concern, it is emphasized that this can be explained by the way different schemes use term occurrence differences in generating term weights. Then, it is proposed that the relevance frequency which is shown to provide the best scores on several datasets can be improved by taking into account the way absolute difference values are used in other widely used schemes. Experimental results on two different datasets have shown that improved F-1 scores can be achieved.
引用
收藏
页码:148 / 160
页数:13
相关论文
共 28 条
[1]   Analytical evaluation of term weighting schemes for text categorization [J].
Altincay, Hakan ;
Erenel, Zafer .
PATTERN RECOGNITION LETTERS, 2010, 31 (11) :1310-1323
[2]  
[Anonymous], P 14 ACM SIGKDD INT
[3]  
Buckley C., 1986, Implementation of the SMART information retrieval system
[4]   A hierarchical neural network document classifier with linguistic feature selection [J].
Chen, CM ;
Lee, HM ;
Hwang, CW .
APPLIED INTELLIGENCE, 2005, 23 (03) :277-294
[5]   Feature selection for text classification with Naive Bayes [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Shengfeng ;
Qu, Youli .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435
[6]  
Craven M, 1998, FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, P509
[7]   An analysis of the relative hardness of Reuters-21578 subsets [J].
Debole, F ;
Sebastiani, F .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (06) :584-596
[8]  
Debole F, 2004, STUD FUZZ SOFT COMP, V138, P81
[9]   Authorship attribution with support vector machines [J].
Diederich, J ;
Kindermann, O ;
Leopold, E ;
Paass, G .
APPLIED INTELLIGENCE, 2003, 19 (1-2) :109-123
[10]  
ERENEL Z, 2009, P 5 INT C SOFT COMP