Link Analysis for Web Spam Detection

被引:69
作者
Becchetti, Luca [2 ]
Castillo, Carlos [1 ]
Donato, Debora [1 ]
Baeza-Yates, Ricardo [1 ]
Leonardi, Stefano [2 ]
机构
[1] Yahoo Res, Barcelona, Spain
[2] Univ Roma La Sapienza, Rome, Italy
关键词
Link analysis; adversarial information retrieval;
D O I
10.1145/1326561.1326563
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.
引用
收藏
页数:42
相关论文
共 51 条
  • [1] The space complexity of approximating the frequency moments
    Alon, N
    Matias, Y
    Szegedy, M
    [J]. JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1999, 58 (01) : 137 - 147
  • [2] Angelova R., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P485, DOI 10.1145/1148170.1148254
  • [3] [Anonymous], P 1 INT WORKSH ADV I
  • [4] [Anonymous], P 1 INT WORKSH ADV I
  • [5] [Anonymous], 2006, P ADV INF RETR WEB 2
  • [6] Baeza-Yates R., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P308, DOI 10.1145/1148170.1148225
  • [7] Baeza-Yates R.A., 1999, Modern Information Retrieval
  • [8] BAEZAYATES R, 2005, 1 INT WORKSH ADV INF, V1
  • [9] Becchetti L, 2006, P WORKSH WEB MIN WEB
  • [10] UbiCrawler: a scalable fully distributed Web crawler
    Boldi, P
    Codenotti, B
    Santini, M
    Vigna, S
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2004, 34 (08) : 711 - 726