Some effective techniques for naive Bayes text classification

被引：312

作者：

Kim, Sang-Bum

Han, Kyoung-Soo

Rim, Hae-Chang

Myaeng, Sung Hyon

机构：

[1] Korea Univ, Coll Informat & Commun, Dept Comp Sci & Engn, Seoul 136701, South Korea

[2] Informat & Commun Univ, Taejon 305732, South Korea

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2006年 / 18卷 / 11期

基金：

日本学术振兴会;

关键词：

text classification; naive Bayes classifier; Poisson model; feature weighting;

D O I：

10.1109/TKDE.2006.180

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While naive Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naive Bayes for the natural language text, we found a serious problem in the parameter estimation process, which causes poor results in text classification domain. In this paper, we propose two empirical heuristics: per-document text normalization and feature weighting method. While these are somewhat ad hoc methods, our proposed naive Bayes text classifier performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM.

引用

页码：1457 / 1466

页数：10

共 19 条

[1]

[Anonymous], [No title captured]

[2] On the optimality of the simple Bayesian classifier under zero-one loss [J].

Domingos, P ;

Pazzani, M .

MACHINE LEARNING, 1997, 29 (2-3) :103-130

[3]

Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651

[4]

How BC, 2004, IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2004), PROCEEDINGS, P599

[5]

Joachims T., 1996, ICML 97 PROC 14 INT, DOI DOI 10.1016/J.ESWA.2016.09.009

[6]

Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683

[7]

Lewis DD., 1998, P 10 EUR C MACH LEAR, V98, P4

[8]

LEWIS DD, 1992, THESIS U MASSACHUSET

[9]

McCallum AndrewK., 1998, P ICML 98 15 INT C M, P350

[10]

MLADENIC D, 1998, P 10 EUR C MACH LEAR, P95

← 1 2 →