Discretization for naive-Bayes learning: managing discretization bias and variance

被引:145
作者
Yang, Ying [1 ]
Webb, Geoffrey I. [2 ]
机构
[1] Australian Taxat Off, Box Hill, Vic 3128, Australia
[2] Monash Univ, Fac Informat Technol, Clayton, Vic 3800, Australia
关键词
Discretization; Naive-Bayes Learning; Bias; Variance; CLASSIFIERS; ASSUMPTION;
D O I
10.1007/s10994-008-5083-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.
引用
收藏
页码:39 / 74
页数:36
相关论文
共 64 条
[1]   Learning Bayesian network classifiers: Searching in a space of partially directed acyclic graphs [J].
Acid, S ;
De Campos, LM ;
Castellano, JG .
MACHINE LEARNING, 2005, 59 (03) :213-235
[2]  
An A., 1999, Proc. of the 3rd Pacific - Asia Conf. on Knowledge Discovery and Data Mining (PAKDD-99), Kyoto, P509
[3]  
Androutsopoulos I., 2000, Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, P160
[4]  
[Anonymous], P 14 BRIT NAT C DAT
[5]  
[Anonymous], P 19 INT C ICML 2002
[6]  
[Anonymous], 1993, Proceedings of the 13th International Joint Conference on Artificial Intelligence
[7]  
[Anonymous], 1997, MACHINE LEARNING, MCGRAW-HILL SCIENCE/ENGINEERING/MATH
[8]  
Bay S.D., 1999, UCI KDD ARCH
[9]  
Bluman A.G., 1992, Elementary statistics: A step by step approach
[10]  
BREIMAN L, 1996, 460 U CAL STAT DEP, P460