AN EFFICIENT APPROACH TO COMMENT SPAM IDENTIFICATION

被引:1
作者
Yang Yuhang Zhao Tiejun Zheng Dequan Yu Hao MOEMS Key Laboratory of Natural Language Processing and Speech Harbin Institute of Technology Harbin China [150001 ]
机构
关键词
Comment spam; Automatic identification; Content analysis; Blog;
D O I
暂无
中图分类号
TP391.3 [检索机];
学科分类号
081203 ; 0835 ;
摘要
This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.
引用
收藏
页码:644 / 650
页数:7
相关论文
共 2 条
[1]  
http://www.ysearchblog.com/archives/ 000069. html . 2006
[2]  
Collaborative blog spam filtering using adaptive percolation search .2 Seungyeop Han,Yongyeol Ahn,Sue Moon,et al. The 15th International Conference on World Wide Web . 2006