A memory-based approach to anti-spam filtering for mailing lists

被引：144

作者：

Sakkis, G ^{[1
]}

Androutsopoulos, I

Paliouras, G

Karkaletsis, V

Spyropoulos, CD

Stamatopoulos, P

机构：

[1] Natl Ctr Sci Res Demokritos, Inst Informat & Telecommun, GR-15310 Athens, Greece

[2] Athens Univ Econ & Business, Dept Informat, GR-10434 Athens, Greece

[3] Natl Ctr Sci Res Demokritos, Inst Informat & Telecommun, GR-15310 Athens, Greece

[4] Univ Athens, Dept Informat, GR-15771 Athens, Greece

来源：

INFORMATION RETRIEVAL | 2003年 / 6卷 / 01期

关键词：

text categorization; machine learning; unsolicited commercial e-mail; spam;

D O I：

10.1023/A:1022948414856

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

引用

页码：49 / 73

页数：25

共 44 条

[1] INSTANCE-BASED LEARNING ALGORITHMS
AHA, DW
KIBLER, D
ALBERT, MK
[J]. MACHINE LEARNING, 1991, 6 (01) : 37 - 66
[2] TOLERATING NOISY, IRRELEVANT AND NOVEL ATTRIBUTES IN INSTANCE-BASED LEARNING ALGORITHMS
AHA, DW
[J]. INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1992, 36 (02): : 267 - 287
[3] Andrén N, 2000, J STRATEGIC STUD, V23, P167
[4] Androutsopoulos I, 2000, P WORKSH MACH LEARN, P9
[5] [Anonymous], P AAAI SPRING S MACH
[6] [Anonymous], P 4 EUR C PRINC PRAC
[7] [Anonymous], 1995, ICML
[8] BAILEY T, 1978, IEEE T SYST MAN CYB, V8, P311
[9] Context-sensitive learning methods for text categorization
Cohen, WW
Singer, Y
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1999, 17 (02) : 141 - 173
[10] NEAREST NEIGHBOR PATTERN CLASSIFICATION
COVER, TM
HART, PE
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 1967, 13 (01) : 21 - +

← 1 2 3 4 5 →