On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

被引:117
作者
Pedro Domingos
Michael Pazzani
机构
[1] University of California,Department of Information and Computer Science
来源
Machine Learning | 1997年 / 29卷
关键词
Simple Bayesian classifier; naive Bayesian classifier; zero-one loss; optimal classification; induction with attribute dependences;
D O I
暂无
中图分类号
学科分类号
摘要
The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive. This article shows that, although the Bayesian classifier's probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zero-one loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadratic-loss optimality of the Bayesian classifier is in fact a second-order infinitesimal fraction of the region of zero-one optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article's results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.
引用
收藏
页码:103 / 130
页数:27
相关论文
共 16 条
  • [1] Ben-Bassat M.(1980)Sensitivity analysis in Bayesian classification models: Multiplicative deviations IEEE Transactions on Pattern Analysis and Machine Intelligence 2 261-266
  • [2] Klove K. L.(1989)The CN2 induction algorithm Machine Learning 3 261-283
  • [3] Weil M. H.(1993)A weighted nearest neighbor algorithm for learning with symbolic features Machine Learning 10 57-78
  • [4] Clark P.(1994)Error rates in quadratic discrimination with constraints on the covariance matrices Journal of Classification 11 101-120
  • [5] Niblett T.(1988)Quantifying inductive bias: AI learning algorithms and Valiant's learning framework Artificial Intelligence 36 177-221
  • [6] Cost S.(1990)A framework for average case analysis of conjunctive learning algorithms Machine Learning 9 349-372
  • [7] Salzberg S.(1983)The effect of assuming independence in applying Bayes' theorem to risk estimation and classification in diagnosis Computers and Biomedical Research 16 537-552
  • [8] Flury B.(undefined)undefined undefined undefined undefined-undefined
  • [9] Schmid M. J.(undefined)undefined undefined undefined undefined-undefined
  • [10] Narayanan A.(undefined)undefined undefined undefined undefined-undefined