基于统计特征的垃圾博客过滤

被引：5

作者：

刘玮 ^{[1
]}

廖祥文 ^{[1
]}

许洪波 ^{[1
]}

王丽宏 ^{[2
]}

机构：

[1] 中国科学院计算技术研究所信息智能与信息安全研究中心

[2] 国家计算机网络与信息安全管理中心

来源：

中文信息学报 | 2008年 / 22卷 / 06期

关键词：

计算机应用; 中文信息处理; 内容分析; 垃圾博客过滤; 统计特征; 词频特征; 泛化能力;

D O I：

暂无

中图分类号：

TP391.41 [];

学科分类号：

080203 ;

摘要：

该文根据垃圾博客和正常博客在统计特征上的差异,对多种针对博客分类有效的统计特征进行了分析,提出基于博客页面统计特征的过滤方法。在Blog06数据集上的实验表明,该方法的过滤准确性达到97%,比基于词频特征的过滤方法提高了约7%,在不同规模训练集上的准确性保持在95%左右,具有更好的泛化能力。

引用

页码：86 / 91

页数：6

共 7 条

[1]

The TREC Blog06 Collection:Creating and Analysing a Blog Test Collection. Macdonald C,Ounis I. DCS Technical Report TR-2006-224 . 2006

[2]

Blog Track Open Task:Spam BlogClassification. Kolari P,Java A,Finin T,Mayfield J,Joshi A,Martineau J. TREC 2006 Blog Track Notebook .

[3]

Characterizing thesplogosphere. Kolari P,Java A,Finin T. Proc.of the World Wide Web 2006Workshop on the Webloggging Ecosystem:Aggregation,Analysis and Dynamics . 2006

[4]

SVMs for theblogosphere:Blog identification and splog detection. Kolari P.,Finin T,Joshi A. Proc.of the AAAI Spring Symp.on ComputationalApproaches to Analyzing Weblogs . 2006

[5]

Splog Detection using self-similarityanalysis on blog temporal dynamics. Yu-Ru Lin,Hari Sundaram,Yun Chi,Junichi Tatemura,Belle L Tseng. Proc.of the ACMWorkshop on Adversarial information retrieval on the web . 2007

[6]

Weblog Classification for FastSplog Filtering:A URL Language Model SegmentationApproach. Salvetti F,Nicolov N. Proc.of the Human Language TechnologyConference of the NAACL,Companion Volume:ShortPapers .

[7]

Detecting spam web pages through content analysis. Ntoulas A,Najork M,Manasse M,Fetterly D. Proc.of the 15th international conference on World WideWeb . 2006

← 1 →