Exploiting unlabeled data to improve peer-to-peer traffic classification using incremental tri-training method

被引:22
作者
Raahemi, Bijan [1 ]
Zhong, Weicai [1 ]
Liu, Jing [2 ]
机构
[1] Univ Ottawa, Telfer Sch Management, Ottawa, ON K1N 6N5, Canada
[2] Xidian Univ, Inst Intelligent Informat Proc, Xian 710071, Shaanxi, Peoples R China
基金
加拿大自然科学与工程研究理事会;
关键词
Stream data mining; Concept drift; Windowing technique; Tri-training; Unlabeled data; Peer-to-peer traffic; IP traffic identification;
D O I
10.1007/s12083-008-0022-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
Unlabeled training examples are readily available in many applications, but labeled examples are fairly expensive to obtain. For instance, in our previous works on classification of peer-to-peer (P2P) Internet traffics, we observed that only about 25% of examples can be labeled as "P2P" or "NonP2P" using a port-based heuristic rule. We also expect that even fewer examples can be labeled in the future as more and more P2P applications use dynamic ports. This fact motivates us to investigate the techniques which enhance the accuracy of P2P traffic classification by exploiting the unlabeled examples. In addition, the Internet data flows dynamically in large volumes ( streaming data). In P2P applications, new communities of peers often join and old communities of peers often leave, requiring the classifiers to be capable of updating the model incrementally, and dealing with concept drift. Based on these requirements, this paper proposes an incremental TriTraining (iTT) algorithm. We tested our approach on a real data stream with 7.2 Mega labeled examples and 20.4 Mega unlabeled examples. The results show that iTT algorithm can enhance accuracy of P2P traffic classification by exploiting unlabeled examples. In addition, it can effectively deal with dynamic nature of streaming data to detect the changes in communities of peers. We extracted attributes only from the IP layer, eliminating the privacy concern associated with the techniques that use deep packet inspection.
引用
收藏
页码:87 / 97
页数:11
相关论文
共 25 条
[1]
[Anonymous], 2008, COMPUT SCI
[2]
Bayesian neural networks for Internet traffic classification [J].
Auld, Tom ;
Moore, Andrew W. ;
Gull, Stephen F. .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2007, 18 (01) :223-239
[3]
Ben Azzouna N, 2004, GLOB TELECOMM CONF, P1544
[4]
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962
[5]
Blum A, 2001, P 18 INT C MACH LEAR, P19, DOI DOI 10.1184/R1/6606860.V1
[6]
*CLOUD SHIELD, 2007, PEER TO PEER TRAFF C
[7]
Crovella M., 2006, Internet Measurement: Infrastructure, Traffic & Applications
[8]
Goldman S.A., 2000, P 17 INT C MACH LEAR, P327
[9]
Joachims T, 1999, MACHINE LEARNING, PROCEEDINGS, P200
[10]
KAMEI S, 2003, COMPUTERS SIGNAL PRO, V2, P597