A comparative study on text representation schemes in text categorization

被引:3
作者
Fengxi Song
Shuhai Liu
Jingyu Yang
机构
[1] Nanjing University of Science and Technology,Department of Computer Science
来源
Pattern Analysis and Applications | 2005年 / 8卷
关键词
Text categorization; Text representation; Support vector machines; Multi-way analysis of variance; Pattern recognition;
D O I
暂无
中图分类号
学科分类号
摘要
It is well known that the classification effectiveness of the text categorization system is not simply a matter of learning algorithms. Text representation factors are also at work. This paper will consider the ways in which the effectiveness of text classifiers is linked to the five text representation factors: “stop words removal”, “word stemming”, “indexing”, “weighting”, and “normalization”. Statistical analyses of experimental results show that performing “normalization” can always promote effectiveness of text classifiers significantly. The effects of the other factors are not as great as expected. Contradictory to common sense, a simple binary indexing method can sometimes be helpful for text categorization.
引用
收藏
页码:199 / 209
页数:10
相关论文
共 12 条
[1]  
Maron M(1961)Automatic indexing: an experimental inquiry J Assoc Comput Mach 8 404-417
[2]  
Sebastiani F(2002)Machine learning in automated text categorization ACM Comput Surv 34 1-47
[3]  
Jain AK(2000)Statistical pattern recognition: a review IEEE Trans PAMI 22 4-37
[4]  
Duin RPW(1999)An evaluation of statistical approaches to text categorization Inf Retrieval 1 69-90
[5]  
Mao J(1998)Text categorization based on regularized linear classification methods Text categorization with support vector machines learning-31
[6]  
Yang Y(2001)A vector space model for automatic indexing Inf Retrieval 4 5-620
[7]  
Joachims T(1975)undefined Commun ACM 18 613-undefined
[8]  
Zhang FJ(undefined)undefined undefined undefined undefined-undefined
[9]  
Oles G(undefined)undefined undefined undefined undefined-undefined
[10]  
Salton A(undefined)undefined undefined undefined undefined-undefined