Exact Recovery in the Stochastic Block Model

被引:304
作者
Abbe, Emmanuel [1 ]
Bandeira, Afonso S. [2 ,3 ]
Hall, Georgina [4 ]
机构
[1] Princeton Univ, Dept Elect Engn, Program Appl & Computat Math, Princeton, NJ 08544 USA
[2] Princeton Univ, Program Appl & Computat Math, Princeton, NJ 08544 USA
[3] MIT, Dept Math, Cambridge, MA 02142 USA
[4] Princeton Univ, Dept Operat Res & Financial Engn, Princeton, NJ 08544 USA
关键词
Communities; clustering algorithms; detection algorithms; statistical learning; network theory (graphs); BLOCKMODELS; ALGORITHMS;
D O I
10.1109/TIT.2015.2490670
中图分类号
TP [自动化技术、计算机技术];
学科分类号
080201 [机械制造及其自动化];
摘要
The stochastic block model with two communities, or equivalently the planted bisection model, is a popular model of random graph exhibiting a cluster behavior. In the symmetric case, the graph has two equally sized clusters and vertices connect with probability p within clusters and q across clusters. In the past two decades, a large body of literature in statistics and computer science has focused on providing lower bounds on the scaling of vertical bar p - q vertical bar to ensure exact recovery. In this paper, we identify a sharp threshold phenomenon for exact recovery: if alpha = pn/log(n) and beta = qn/log(n) are constant (with alpha > beta), recovering the communities with high probability is possible if (alpha + beta/2)- root alpha beta> 1 and is impossible if (alpha + beta/2)- root alpha beta < 1. In particular, this improves the existing bounds. This also sets a new line of sight for efficient clustering algorithms. While maximum likelihood (ML) achieves the optimal threshold (by definition), it is in the worst case NP-hard. This paper proposes an efficient algorithm based on a semidefinite programming relaxation of ML, which is proved to succeed in recovering the communities close to the threshold, while numerical experiments suggest that it may achieve the threshold. An efficient algorithm that succeeds all the way down to the threshold is also obtained using a partial recovery algorithm combined with a local improvement procedure.
引用
收藏
页码:471 / 487
页数:17
相关论文
共 40 条
[1]
Abbe E., 2014, DECODING BINARY NODE
[2]
Abbe E., 2013, CONDITIONAL RANDOM F
[3]
Abbe E, 2014, IEEE INT SYMP INFO, P1251, DOI 10.1109/ISIT.2014.6875033
[4]
Airoldi EM, 2008, J MACH LEARN RES, V9, P1981
[5]
[Anonymous], 2012, COMMUNITY DETECTION
[6]
A nonparametric view of network models and Newman-Girvan and other modularities [J].
Bickel, Peter J. ;
Chen, Aiyou .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2009, 106 (50) :21068-21073
[7]
Boppana R. B., 1987, 28th Annual Symposium on Foundations of Computer Science (Cat. No.87CH2471-1), P280, DOI 10.1109/SFCS.1987.22
[8]
GRAPH BISECTION ALGORITHMS WITH GOOD AVERAGE CASE BEHAVIOR [J].
BUI, TN ;
CHAUDHURI, S ;
LEIGHTON, FT ;
SIPSER, M .
COMBINATORICA, 1987, 7 (02) :171-191
[9]
Carson T, 2001, SIAM PROC S, P903
[10]
Chen Y., 2012, IMPROVED GRAPH CLUST