Some effective techniques for naive Bayes text classification

被引:312
作者
Kim, Sang-Bum
Han, Kyoung-Soo
Rim, Hae-Chang
Myaeng, Sung Hyon
机构
[1] Korea Univ, Coll Informat & Commun, Dept Comp Sci & Engn, Seoul 136701, South Korea
[2] Informat & Commun Univ, Taejon 305732, South Korea
基金
日本学术振兴会;
关键词
text classification; naive Bayes classifier; Poisson model; feature weighting;
D O I
10.1109/TKDE.2006.180
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While naive Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naive Bayes for the natural language text, we found a serious problem in the parameter estimation process, which causes poor results in text classification domain. In this paper, we propose two empirical heuristics: per-document text normalization and feature weighting method. While these are somewhat ad hoc methods, our proposed naive Bayes text classifier performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM.
引用
收藏
页码:1457 / 1466
页数:10
相关论文
共 19 条
[1]  
[Anonymous], [No title captured]
[2]   On the optimality of the simple Bayesian classifier under zero-one loss [J].
Domingos, P ;
Pazzani, M .
MACHINE LEARNING, 1997, 29 (2-3) :103-130
[3]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[4]  
How BC, 2004, IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, P599
[5]  
Joachims T., 1996, ICML 97 PROC 14 INT, DOI DOI 10.1016/J.ESWA.2016.09.009
[6]  
Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683
[7]  
Lewis DD., 1998, P 10 EUR C MACH LEAR, V98, P4
[8]  
LEWIS DD, 1992, THESIS U MASSACHUSET
[9]  
McCallum AndrewK., 1998, P ICML 98 15 INT C M, P350
[10]  
MLADENIC D, 1998, P 10 EUR C MACH LEAR, P95