Offline/realtime traffic classification using semi-supervised learning

被引:158
作者
Erman, Jeffrey
Mahanti, Anirban [1 ]
Arlitt, Martin
Cohen, Ira
Williamson, Carey
机构
[1] Indian Inst Technol, Dept Comp Sci & Engn, Delhi, India
[2] HP Labs, Enterprise Syst & Software Lab, Palo Alto, CA USA
[3] Univ Calgary, Dept Comp Sci, Calgary, AB T2N 1N4, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Internet traffic classification; realtime classification; machine learning; semi-supervised learning;
D O I
10.1016/j.peva.2007.06.014
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Identifying and categorizing network traffic by application type is challenging because of the continued evolution of applications, especially of those with a desire to be undetectable. The diminished effectiveness of port-based identification and the overheads of deep packet inspection approaches motivate us to classify traffic by exploiting distinctive flow characteristics of applications when they communicate on a network. In this paper, we explore this latter approach and propose a semi-supervised classification method that can accommodate both known and unknown applications. To the best of our knowledge, this is the first work to use semi-supervised learning techniques for the traffic classification problem. Our approach allows classifiers to be designed from training data that consists of only a few labeled and many unlabeled flows. We consider pragmatic classification issues such as longevity of classifiers and the need for retraining of classifiers. Our performance evaluation using empirical Internet traffic traces that span a 6-month period shows that: (1) high flow and byte classification accuracy (i.e., greater than 90%) can be achieved using training data that consists of a small number of labeled and a large number of unlabeled flows; (2) presence of "mice" and "elephant" flows in the Internet complicates the design of classifiers, especially of those with high byte accuracy, and necessitates the use of weighted sampling techniques to obtain training flows; and (3) retraining of classifiers is necessary only when there are non-transient changes in the network usage characteristics. As a proof of concept, we implement prototype offline and realtitne classification systems to demonstrate the feasibility of our approach. (c) 2007 Published by Elsevier B.V.
引用
收藏
页码:1194 / 1213
页数:20
相关论文
共 31 条
  • [1] [Anonymous], LCN 05 SYDN AUSTR NO
  • [2] [Anonymous], IMC 04 TAORM IT OCT
  • [3] [Anonymous], SIGCOMM 05 MINENET W
  • [4] [Anonymous], SIGCOMM 06 MINENET W
  • [5] [Anonymous], IMC 04 TAORM IT OCT
  • [6] [Anonymous], SIGCOMM 05 PHIL US A
  • [7] BASU S, 2004, P KDD 04 SEATTL US A
  • [8] BERNAILLE L, 2006, CONEXT 06 LISB PORT
  • [9] *CACH LOG, 2005, PEER TO PEER 2005
  • [10] Chapelle O., 2009, IEEE Trans. Neural Netw., V20, P542, DOI [10.1109/TNN.2009.2015974, DOI 10.1109/TNN.2009.2015974]