Twitter spammer detection using data stream clustering

被引:219
作者
Miller, Zachary [1 ]
Dickinson, Brian [1 ]
Deitrick, William [1 ]
Hu, Wei [1 ]
Wang, Alex Hai [2 ]
机构
[1] Houghton Coll, Dept Comp Sci, Houghton, NY 14744 USA
[2] Penn State Univ, Coll Informat Sci & Technol, Dunmore, PA USA
关键词
Twitter; Spam detection; Clustering; Data stream;
D O I
10.1016/j.ins.2013.11.016
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The rapid growth of Twitter has triggered a dramatic increase in spam volume and sophistication. The abuse of certain Twitter components such as "hashtags", "mentions", and shortened URLs enables spammers to operate efficiently. These same features, however, may be a key factor in identifying new spam accounts as shown in previous studies. Our study provides three novel contributions. Firstly, previous studies have approached spam detection as a classification problem, whereas we view it as an anomaly detection problem. Secondly, 95 one-gram features from tweet text were introduced alongside the user information analyzed in previous studies. Finally, to effectively handle the streaming nature of tweets, two stream clustering algorithms, StreamKM++ and DenStream, were modified to facilitate spam identification. Both algorithms clustered normal Twitter users, treating outliers as spammers. Each of these algorithms performed well individually, with StreamKM++ achieving 99% recall and a 6.4% false positive rate; and DenStream producing 99% recall and a 2.8% false positive rate. When used in conjunction, these algorithms reached 100% recall and a 2.2% false positive rate, meaning that our system was able to identify 100% of the spammers in our test while incorrectly detecting only 2.2% of normal users as spammers. (C) 2013 Elsevier Inc. All rights reserved.
引用
收藏
页码:64 / 73
页数:10
相关论文
共 21 条
[1]  
Ackermann M., 2010, STREAMKM CLUSTERING
[2]  
Ankerst M, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P49
[3]  
[Anonymous], 2010, ANN COLL EL MESS ANT
[4]  
[Anonymous], 2007, P 18 ANN ACM SIAM S
[5]  
[Anonymous], 2012, EMARKETER
[6]  
Benevenuto F., 2010, Detecting Spammers on Twitter
[7]  
Bifet A, 2010, J MACH LEARN RES, V11, P1601
[8]   Density connected clustering with local subspace preferences [J].
Böhm, C ;
Kailing, K ;
Kriegel, HP ;
Kröger, P .
FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, :27-34
[9]  
Cao F., 2006, SIAM C DAT MIN BETH
[10]  
Ester M., 1996, KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining, P226