A review of machine learning approaches to Spam filtering

被引:307
作者
Guzella, Thiago S. [1 ]
Caminhas, Walmir M. [1 ]
机构
[1] Univ Fed Minas Gerais, Dept Elect Engn, BR-31270910 Belo Horizonte, MG, Brazil
关键词
Spam filtering; Online learning; Bag-of-words (BoW); Naive Bayes; Image spam; ARTIFICIAL IMMUNE-SYSTEM; SUPPORT VECTOR MACHINES; FEATURE-SELECTION; CONCEPT DRIFT; CLASSIFICATION; EXTRACTION; GENERATION; KNOWLEDGE; MESSAGES; MODELS;
D O I
10.1016/j.eswa.2009.02.037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:10206 / 10222
页数:17
相关论文
共 124 条
[71]   Binary LNS-based nalive Bayes inference engine for spam control: noise analysis and FPGA implementation [J].
Marsono, M. N. ;
El-Kharashi, M. Watheq ;
Gebali, F. .
IET COMPUTERS AND DIGITAL TECHNIQUES, 2008, 2 (01) :56-62
[72]   Competing for consumer's attention [J].
Martin-Herran, Guiomar ;
Rubel, Olivier ;
Zaccour, Georges .
AUTOMATICA, 2008, 44 (02) :361-370
[73]  
MEDLOCK B, 2006, P 3 C EM ANT
[74]   Managing irrelevant knowledge in CBR models for unsolicited e-mail classification [J].
Mendez, J. R. ;
Glez-Pena, D. ;
Fdez-Riverola, F. ;
Diaz, F. ;
Corchado, J. M. .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (02) :1601-1614
[75]  
METSIS V, 2006, P C EM ANT
[76]  
Oda T, 2005, LECT NOTES COMPUT SC, V3627, P276
[77]  
Oda T, 2003, LECT NOTES COMPUT SC, V2723, P231
[78]  
ODA T, 2003, P IEEE C EV COMP, V1
[79]   Adaptive anti-spam filtering for agglutinative languages:: a special case for Turkish [J].
Özgür, L ;
Güngör, T ;
Gürgen, F .
PATTERN RECOGNITION LETTERS, 2004, 25 (16) :1819-1831
[80]   A suffix tree approach to anti-spam email filtering [J].
Pampapathi, Rajesh ;
Mirkin, Boris ;
Levene, Mark .
MACHINE LEARNING, 2006, 65 (01) :309-338