Multiscale Matrix Sampling and Sublinear-Time PageRank Computation

被引:19
作者
Borgs, Christian [1 ]
Brautbar, Michael [2 ]
Chayes, Jennifer [1 ]
Teng, Shang-Hua [3 ]
机构
[1] Microsoft Res New England, 1 Mem Dr, Cambridge, MA 02142 USA
[2] Univ Penn, Comp & Informat Sci Dept, Philadelphia, PA 19104 USA
[3] Univ Southern Calif, Comp Sci Dept, Los Angeles, CA 90089 USA
基金
美国国家科学基金会;
关键词
D O I
10.1080/15427951.2013.802752
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A fundamental problem arising in many applications in Web science and social network analysis is the problem of identifying all nodes in a network whose PageRank exceeds a given threshold Delta. In this paper, we study the probabilistic version of the problem whereby given an arbitrary approximation factor c > 1, we are asked to output a set S of nodes such that with high probability, S contains all nodes of PageRank at least Delta, and no node of PageRank smaller than Delta/c. We call this problem SIGNIFICANTPAGERANKS. We develop a nearly optimal local algorithm for the problem with time complexity (O) over tilde (n/Delta) on networks with n nodes, where the tilde hides a polylogarithmic factor. We show that every algorithm for solving this problem must have running time of Omega(n/Delta), rendering our algorithm optimal up to logarithmic factors. Our algorithm has sublinear time complexity for applications including Web crawling and Web search that require efficient identification of nodes whose PageRanks are above a threshold Delta = n(delta), for some constant 0 < delta < 1. Our algorithm comes with two main technical contributions. The first is a multiscale sampling scheme for a basic matrix problem that could be of interest on its own. For us, it appears as an abstraction of a subproblem we need to tackle in order to solve the SIGNIFICANTPAGERANKS problem, but we hope that this abstraction will be useful in designing fast algorithms for identifying nodes that are significant beyond PageRank measurements. In the abstract matrix problem, it is assumed that one can access an unknown right-stochastic matrix by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter epsilon. At a cost propositional to 1/epsilon, the query will return a list of O(1/epsilon) entries and their indices that provide an epsilon-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least Delta and omits every column whose sum is less than Delta/c. Our multiscale sampling scheme solves this problem with cost (O) over tilde (n/Delta), while traditional sampling algorithms would take time Theta((n/Delta)(2)). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in [Jeh and Widom 03, Andersen et al. 06] and is highly efficient, particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme, we are able to solve the SIGNIFICANT-PAGERANKS problem optimally.
引用
收藏
页码:20 / 48
页数:29
相关论文
共 18 条
[1]  
Andersen R, 2006, ANN IEEE SYMP FOUND, P475
[2]   Local Computation of PageRank Contributions [J].
Andersen, Reid ;
Borgs, Christian ;
Chayes, Jennifer ;
Hopcroft, John ;
Mirrokni, Vahab ;
Teng, Shang-Hua .
INTERNET MATHEMATICS, 2008, 5 (1-2) :23-45
[3]   Monte Carlo methods in pagerank computation: When one iteration is sufficient [J].
Avrachenkov, K. ;
Litvak, N. ;
Nemirovsky, D. ;
Osipova, N. .
SIAM JOURNAL ON NUMERICAL ANALYSIS, 2007, 45 (02) :890-904
[4]  
Bahmani B, 2010, PROC VLDB ENDOW, V4, P173
[5]   A Survey on PageRank Computing [J].
Berkhin, Pavel .
INTERNET MATHEMATICS, 2005, 2 (01) :73-120
[6]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[7]  
Chi-Chih Yao A., 1977, 18th Annual Symposium on Foundations of Computer Science, P222
[8]   An improved data stream summary: the count-min sketch and its applications [J].
Cormode, G ;
Muthukrishnan, S .
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2005, 55 (01) :58-75
[9]   SPARSE MATRICES IN MATLAB - DESIGN AND IMPLEMENTATION [J].
GILBERT, JR ;
MOLER, C ;
SCHREIBER, R .
SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS, 1992, 13 (01) :333-356
[10]  
Goldreich O, 2010, LECT NOTES COMPUT SC, V6390, P105, DOI 10.1007/978-3-642-16367-8_7