Combating imbalance in network intrusion datasets

被引：218

作者：

Cieslak, David A. ^{[1
]}

Chawla, Nitesh V. ^{[1
]}

Striegel, Aaron ^{[1
]}

机构：

[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 USA

来源：

2006 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING | 2006年

关键词：

computer network security; imbalanced datasets; classification; ROC curves;

D O I：

10.1109/GRC.2006.1635905

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 [模式识别与智能系统]; 0812 [计算机科学与技术]; 0835 [软件工程]; 1405 [智能科学与技术];

摘要：

An approach to combating network intrusion is the development of systems applying machine learning and data mining techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets - destination and actual packets based - constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER's loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.

引用

页码：732 / +

页数：2

共 23 条

[1]

[Anonymous], A20012 U TAMP

[2]

Axelsson S., 2000, ACM Transactions on Information and Systems Security, V3, P186, DOI 10.1145/357830.357849

[3]

Bloedorn E, 2001, DATA MINING NETWORK

[4]

SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

[5]

Cohen W. W., 1995, P 12 INT C MACH LEAR, P115, DOI DOI 10.1016/B978-1-55860-377-6.50023-2

[6]

A WEIGHTED NEAREST NEIGHBOR ALGORITHM FOR LEARNING WITH SYMBOLIC FEATURES [J].

COST, S ;

SALZBERG, S .

MACHINE LEARNING, 1993, 10 (01) :57-78

[7]

Japkowicz N, 2000, IC-AI'2000: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 1-III, P111

[8]

Javitz H. S., 1994, NIDES STAT COMPONENT

[9]

Kotcz, 2004, ACM SIGKDD EXPLORATI, V6, P1, DOI [DOI 10.1145/1007730.1007733, 10.1145/3262579]

[10]

LAZAREVIC A, 2003, P 3 SIAM C DAT MIN J

← 1 2 3 →