A review of machine learning approaches to Spam filtering

被引:307
作者
Guzella, Thiago S. [1 ]
Caminhas, Walmir M. [1 ]
机构
[1] Univ Fed Minas Gerais, Dept Elect Engn, BR-31270910 Belo Horizonte, MG, Brazil
关键词
Spam filtering; Online learning; Bag-of-words (BoW); Naive Bayes; Image spam; ARTIFICIAL IMMUNE-SYSTEM; SUPPORT VECTOR MACHINES; FEATURE-SELECTION; CONCEPT DRIFT; CLASSIFICATION; EXTRACTION; GENERATION; KNOWLEDGE; MESSAGES; MODELS;
D O I
10.1016/j.eswa.2009.02.037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual- and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:10206 / 10222
页数:17
相关论文
共 124 条
[1]  
AAMODT A, 1994, AI COMMUN, V7, P39
[2]  
ABIHAIDAR A, 2008, LECT NOTES COMPUTER, V5132
[3]  
Aha DW, 1997, ARTIF INTELL REV, V11, P7, DOI 10.1023/A:1006538427943
[4]  
ANDROUTSOPOULOS I, 2004, 20042 NCSR
[5]  
ANDROUTSOPOULOS I, 2000, P 11 EUR C MACH LEAR
[6]  
[Anonymous], ASIAN LANGUAGE INFOR, DOI DOI 10.1145/1039621.1039625
[7]  
[Anonymous], P 1 INT WORKSH ADV I
[8]  
[Anonymous], P TREC 2007 16 TEXT
[9]  
[Anonymous], 1998, NEURAL NETWORKS COMP
[10]  
[Anonymous], 2007, Advances in neural information processing systems