Feature selection for text classification with Naive Bayes

被引:339
作者
Chen, Jingnian [1 ,2 ]
Huang, Houkuan [1 ]
Tian, Shengfeng [1 ]
Qu, Youli [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing 100044, Peoples R China
[2] Shandong Univ Finance, Dept Informat & Comp Sci, Jinan 250014, Shandong, Peoples R China
基金
中国国家自然科学基金;
关键词
Text classification; Feature selection; Text preprocessing; Naive Bayes; NEAREST-NEIGHBOR;
D O I
10.1016/j.eswa.2008.06.054
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As an important preprocessing technology in text classification, feature selection can improve the scalability, efficiency and accuracy of a text classifier. In general, a good feature selection method should consider domain and algorithm characteristics. As the Naive Bayesian classifier is very simple and efficient and highly sensitive to feature selection, so the research of feature selection specially for it is significant. This paper presents two feature evaluation metrics for the Naive Bayesian classifier applied on multi-class text datasets: Multi-class Odds Ratio (MOR), and Class Discriminating Measure (CDM). Experiments of text classification with Naive Bayesian classifiers were carried out on two multi-class texts collections. As the results indicate, CDM and MOR gain obviously better selecting effect than other feature selection approaches. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:5432 / 5435
页数:4
相关论文
共 18 条
[1]  
[Anonymous], P 14 INT C MACH LEAR
[2]  
[Anonymous], 1995, P 4 ANN S DOCUMENT A
[3]   NEAREST NEIGHBOR PATTERN CLASSIFICATION [J].
COVER, TM ;
HART, PE .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) :21-+
[4]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[5]  
Frank E, 2006, LECT NOTES ARTIF INT, V4213, P503
[6]  
Joachims T., 1998, MACHINE LEARNING ECM, P137, DOI [10.1007/BFb0026683, DOI 10.1007/BFB0026683]
[7]  
John G. H., 1994, MACHINE LEARNING P 1, P121, DOI DOI 10.1016/B978-1-55860-335-6.50023-4
[8]   Some effective techniques for naive Bayes text classification [J].
Kim, Sang-Bum ;
Han, Kyoung-Soo ;
Rim, Hae-Chang ;
Myaeng, Sung Hyon .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (11) :1457-1466
[9]  
Lewis D.D., 1998, LECT NOTES COMPUTER, V1398, P4
[10]  
Lewis D.D., 1994, Third Annual Symposium on Document Analysis and Information Retrieval, P81