基于SMOTE和随机森林的Web spam检测

被引:11
作者
房晓南 [1 ,2 ]
张化祥 [1 ,2 ]
高爽 [1 ,2 ]
机构
[1] 山东师范大学信息科学与工程学院
[2] 山东省分布式计算机软件新技术重点实验室
关键词
集成学习; 搜索引擎垃圾网页; 随机森林; SMOTE; 搜索引擎作弊;
D O I
暂无
中图分类号
TP391.3 [检索机];
学科分类号
081203 ; 0835 ;
摘要
Web spam是指采用某些技术手段,使得网页在搜索引擎检索结果中的排名高于其应得排名的行为,它严重影响搜索结果的质量。考虑到Web spam数据集的严重不平衡情况,本研究提出先使用SMOTE过抽样方法平衡数据集,再利用随机森林算法训练分类器。通过对常见的单分类器和集成学习分类器的对比实验,发现SMOTE+RF方法表现较为突出,并根据实验结果优化了方法中的重要参数,对使用SMOTE方法后AUC值提高的原因进行了分析。在WEBSPAM UK2007数据集上的实验证明,该方法可以显著提高分类器的分类效果,其AUC值已经超过了Web Spam Challenge 2008上的最好成绩。
引用
收藏
页码:22 / 27+33 +33
页数:7
相关论文
共 23 条
[1]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[2]  
Uncovering socialspammers:social honeypots+machine learning. LEE K,CAVERLEE J,WEBB S. Proceedings of the33rd International ACM SIGIR Con-ference on Research and Development in Information Retrieval . 2010
[3]  
Ad-dressing data complexity for imbalanced data sets:analy-sis of SMOTE-based oversampling and evolutionaryundersampling. LUENGO J,FERNNDEZ A,GARCA S,et al. Soft Computing . 2011
[4]  
Boosting the performance of Web spam detectionwith ensemble under-sampling classification. GENG Guanggang,WANG Chunheng,LI Qiudan,etal. Pro-ceedings of Fourth International Conference on FuzzySystems and Knowledge Discovery . 2007
[5]  
The anti-social tagger:detecting spam in social bookmarking sys-tems. KRAUSE B,SCHMITZ C,HOTHO A,et al. Proceedings of the4th International Work-shop on Adversarial Information Retrieval on the Web . 2008
[6]  
Web spam taxonomy. GYO¨NGYI Z,MOLINA H. Proceedings of the 1st International Workshop on Adver-sarial Information Retrieval on the Web . 2005
[7]  
Social Spam Detection. Markines B,Cattuto C,Menczer F. Proc of the5th International Workshop on Adversarial Information Re-trieval on the Web . 2009
[8]  
Analysis of Anchor Text for Web Search. N. Eiron,K. S. McCurley. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval . July28-August012003
[9]  
Knowyour neighbors:Web spam detection using the Web topol-ogy. CASTILLO C,DONATO D,GIONIS A,et al. Proceedings of the 30th Annual InternationalACM SIGIR Conference . 2007
[10]  
Web spamclassification:a few features worth more. ERDLYI M,GARZA,BENCZU’’R A A. Proceed-ings of the 2011 Joint WICOW/AIRWeb Workshop onWeb Quality . 2011