Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

被引:49
作者
Bermejo, Pablo [1 ]
Gamez, Jose A. [1 ]
Puerta, Jose M. [1 ]
机构
[1] Univ Castilla La Mancha, Comp Syst Dept I3A, Intelligent Syst & Data Min Grp, Albacete, Spain
关键词
E-mail foldering; Text categorization; Imbalanced data; Naive Bayes multinomial; Classification;
D O I
10.1016/j.eswa.2010.07.146
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2072 / 2080
页数:9
相关论文
共 47 条
[1]
Abe N., 2004, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P3
[2]
[Anonymous], 1998, WILEY SER PROB STAT
[3]
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[4]
[Anonymous], 1998, FEATURE EXTRACTION C
[5]
Bekkerman R., 2005, Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora
[6]
Bekkerman R., 2006, UNSUPERVISED NONTOPI
[7]
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401
[8]
Chan P. K., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P164
[9]
LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[10]
Chawla N. V., 2004, ACM SIGKDD Explorations Newsletter, V6, P1