Exact Recovery in Stochastic Block Models

Size: px

Start display at page:

Download "Exact Recovery in Stochastic Block Models"

Hilda Boyd
6 years ago
Views:

1 Eact Recovery in Stochastic Block Models Ben Eysenbach May 11, Introduction Graphs are a natural way of capturing interactions between objects, such as phone calls between friends, binding between proteins, and and co-authoring between researchers. While the space of graphs is eponentially large in the number of vertices, most graphs of interest are sparse and ehibit some underlying structure. A number of algorithms attempt to recover this underlying structure. In this survey, we study when recovery is possible and algorithms for recovery. We begin by describing a generative model for unstructured, random graphs in Section 2. We then etend this vanilla model to structured graphs in Section 3. Section 4 discuses why recovering the structure is difficult. We discuss two algorithms for recovery in Sections 5 and 6. We prove that recovery is impossible in some cases in Section 7. A u u th column of matri A B (k k) symmetric matri storing the probabilities partitions interact G Matri of probabilities each pair of vertices has an edge Ĝ Adjacency matri G(p, n) Erdős-Rényi graph on n vertices, including each edge with probability p H Estimate of G computed from Ĝ i, j Particular partitions k Number of partitions n Number of vertices p Probability of including an edge s m Minimum number of vertices in any partition T u Preliminary cluster of vertices, used in spectral algorithm tau Threshold parameter for clustering in spectral algorithm u, v Particular vertices Figure 1: Variable conventions 2 Erdős-Rényi Models The Erdős-Rényi model [3] is a generative model for graphs. Specifically, it defines a distribution over all graphs. The generated graphs do not (in general) ehibit any special structure, but studying Erdős-Rényi graphs does introduce a language which etends to structured graph models. Formally, we define an Erdős-Rényi generative model G(p, n) as a distribution over undirected graphs with n vertices, where each edge (u, v) is included independently with probability p. By linearity of epectation, the epected number of edges is E G(p,n) [num. edges] = (prob. pair of vertices has an edge) (num. pairs of vertices) n(n 1) = p 2 = O(pn 2 ) Most real-world graphs are sparse, meaning the number of edges scales linearly with the number of vertices. For eample, the graph of LinkedIn connections has 7 million vertices but only 30 million 1

edges. [7] This number of edges is a few orders of magnitude smaller than the tens of billions of edges we would epect if the number of edges grew as O(n 2 ).

3 Stochastic Block Model The stochastic block model, sometimes referred to as the planted partition model, is a cousin of the Erdős-Rényi model which allows the (undirected) graph to ehibit some

2 edges. [7] This number of edges is a few orders of magnitude smaller than the tens of billions of edges we would epect if the number of edges grew as O(n 2 ). Thus, we are particularly interested in graphs where p = O(1/n). 3 Stochastic Block Model The stochastic block model, sometimes referred to as the planted partition model, is a cousin of the Erdős-Rényi model which allows the (undirected) graph to ehibit some structure. In the stochastic block model, each verte is assigned a partition, and the probability of including an edge between vertices u and v depends only on the partitions of u and v. We will use the k k symmetric matri B to store the probability of edges between two partitions. The stochastic block model can be interpreted as the concatenation of many Erdős-Rényi graphs. The edges between two partitions eactly follow the Erdős-Rényi model, as do the edges without a single partition. Consider the amount of information required to represent the generative model. While the Erdős-Rényi model only stores a single value p, the stochastic block model stores k(k 1) 2 = O(k 2 ) values in the matri B and n values to store the partitions. We usually consider models where k << n, so the stochastic block model requires much less information than a completely general model, which sould use O(n 2 ) space to store a interaction probability for each individual pair of vertices. Two interesting graph structures are special cases of the stochastic block model. In the planted clique model, a graph is first generated from an Erdős-Rényi model. k vertices are chosen randomly, and all edges between pairs of these k edges are added. This imposes a k clique on the graph. In the second model, the graph coloring model, a graph is also initially generated from an Erdős-Rényi model. Each node is assigned a color, and all edges between nodes of the same color are deleted. Given sample[s] from a stochastic block model, the goal is to recover the assignment of nodes to partitions and the partition interaction matri B. 4 Why are these problems hard? Figure 2: Graph and corresponding adjacency matri Recovering the underlying partitions and interaction matri is deceptively hard. Graphs drawn from stochastic block models are often depicted as in Fig. 2. The matri on the right is the adjacency matri for the graph on the left. Seeing these representations, it is easy to directly read off the partitions. You can empirically estimate the an entry (i, j) in partition interaction matri B by counting the number of edges between nodes in partitions i and j. 2

Figure 3: Scrambled graph and permuted adjacency matri In Fig. 3, we have taken Fig. 2, rearranged the vertices and permuted the rows/cols of the adjacency matri.

$There are eponentially many ways of permuting the rows/cols of the adjacency matri and only a vanishingly small fraction will look like the one shown in Fig.$

3 Figure 3: Scrambled graph and permuted adjacency matri In Fig. 3, we have taken Fig. 2, rearranged the vertices and permuted the rows/cols of the adjacency matri. This does not change the graph, but the structure of this graph is no longer apparent. There are eponentially many ways of permuting the rows/cols of the adjacency matri and only a vanishingly small fraction will look like the one shown in Fig. 2 The two special graph models described in Section 3, the planted clique model and the graph coloring model correspond to famous hard problems. Finding the largest clique in a graph or computing it s chromatic number are both NP-hard problems. 1 Another challenge with recovering the underlying structure is that the Erdős-Rényi places non-zero probability on all graphs. If the Erdős-Rényi model generated a complete graph, we would be unable to determine which k vertices the planted clique model choose as part of the clique. Similarly, if the Erdős-Rényi model generated an empty graph, we would be unable to determine how the graph coloring model colored the vertices. These two challenges indicate that we cannot hope to accurately recover the underlying graph structure for every case. However, most hard problems, including finding the largest clique and graph coloring, are easy in the average case. In the net section, we introduce two algorithms for recovering the underlying structure. 5 Spectral Algorithm 5.1 Intuition The key insight into how spectral algorithms work is that the adjacency matri, Ĝ, is a noisy measurement of another matri G. Each entry in G contains the probability that two nodes interact. We can construct this matri for the stochastic block model by randomly assigning vertices to partitions, and then filling the entries of G by looking up the corresponding entries in B. The stochastic block model samples a graph by including each edge (u, v) with probability given by G(u, v). Matri G is low rank; it contains at most k distinct rows/cols. If we had access to G, we could identify the k distinct types of rows/cols. The type of row/col u would indicate to which partition verte u belonges. Deleting all rows/cols of the same type will leave a k k matri, which equals the partition interaction matri B. Unfortunately, we only have access to the the adjacency matri Ĝ, not G. 5.2 Algorithm We now present a spectral algorithm for recovering the stochastic block model based on McSherry [6]. We will build up to the final algorithm over the course of two failed (but conceptually useful) approaches Approach 1 In the edge probability matri G, vertices u and v belonging to the same partition will have identical columns G u and G v. We epect that the corresponding columns of the adjacency matri, Ĝ u and Ĝv, 1 Both appeared on Richard Karp s 1971 list of 21 NP-complete problems 3

4 are also close. Our first approach greedily clusters vertices based on these columns, using τ is a threshold variable: 1. While some verte u has not been assigned to a partition: (a) Create a new partition and assign to it u and every close verte: {v V Ĝu Ĝv 2 < τ} Unfortunately, the distance between Ĝu and Ĝv may be large, even if u and v belong to the same cluster (so G u = G v ) Approach 2 Our net approach attempts to smooth Ĝ to form another matri H, and then apply Approach 1 to H. We can do this smoothing by first partially clustering the vertices. The clustering step uses variable s m, the number of vertices in the smallest partition. Then we can create matri H be representing each verte as a combination of these initial clusters: 1. Initialize each verte as not assigned 2. While at least 1 2 s m vertices have not been assigned to a cluster: (a) Choose an random unassigned verte u and initialize a new cluster T u =. (b) For each verte v V : i. Compute Ĝu Ĝv, and project the difference onto Ĝ. ii. If the projection has length less than τ, add v to T u and mark v as assigned. 3. For each unassigned verte u (a) Assign u to cluster T v for which u v 2 is minimized 4. Each cluster T u can be represented as a length-v binary vector indicating cluster membership. Stack these indicator vectors into matri C. 5. Let H be the projection of Ĝ onto C. 6. Apply Approach 1 to H. This approach almost works. The one caveat is that we use Ĝ twice, once to compute the indicator matri C, and again when we multiply to C. Using Ĝ multiple times makes analysis tricky Approach 3 The solution given by McSherry [6] splits Ĝ into two matrices of size (roughly) n 1 2n. The proof of correctness requires that we split Ĝ. An alternate proof technique might not require splitting Ĝ Analysis McSherry [6] shows that Approach 3 recovers the stochastic block if the partitions are distinct enough. Formally, we require that for any pair of vertices u and v belonging to different partitions, the L2 distance between G u and G v is large. When this requirement holds, Approach 3 succeeds with probability 1 δ, which can be inflated by repetition. At a high level, the proof required three steps. First, McSherry [6] shows that G and Ĝ are not too far apart. Net, he proves that our smoothed version of Ĝ, H, is close to G. Finally, he shows that the algorithm succeeds when H is close to G. The proof sketch above only shows that the algorithm recovers the partitions, not the partition interaction matri B. B can be empirically estimated after the fact using a simple average. However, it is impossible to recover B eactly. This negative result can be shown using information theory. Each entry in B is a number of arbitrary precision, requiring potentially infinite bits to epress eactly. The graph we are given as input stores only a finite number of bits, one for each pair of vertices. By the Pidgeon Hole Principle, there will be graphs with different matrices B between which our algorithm cannot distinguish. Subsequent work by Vu [9] showed how the smoothing step in Approach 2 can be replaced by a simple SVD. 4

5 6 Semidefinite Programming Algorithm We now consider an alternate approach based on semidefinite programming by Abbe, Bandeira, and Hall [1]. This approach applies only to stochastic block models with two partitions of equal size, where the edge probabilities within and between the two partitions are p and q respective. We assume p > q. 6.1 Defining the Semidefinite Program The general approach is to maimize the number of edges within each partition while minimizing the number of edges between each partition: ma partition We first define this objective algebraically. Define u as an ±1 variable indicating to which partition verte u belongs. We can write our objective as ma We can write this in matri form as: (u,v) E u v ma T D (u,v) / E u v where D is similar to an adjacency matri for the graph: { } 1, if (u, v) E D[u, v] = 1, if (u, v) / E Note that the number of within-partition pairs and the number of between-partition pairs are fied at 2( 1 2 (n/2)(n/2 1)) = 1 2 n( n 2 1) and (n/2)2 = 1 4 n2 respectively. The new objective has the same optimal solution as the old objective: ma T D = ma = ma = ma = ma = ma D[u, v] u v u,v (u,v) E u v (u,v) / E u v (num. non-edges within partitions) + (num. non-edges between partitions) ((num. pairs within partitions) (num. edges within partitions)) + ((num. pairs between partitions) (num. edges between partitions)) 2(num. edges within partitions) 2(num. edges between partitions) 1 2 n(n 2 1) n2 = ma The ±1 constraints on make this optimization problem hard. We instead formulate it as an SDP without integer constraints. A challenge is converting the objective T D into a Frobenius product of two matrices. Recalling that the trace is invariant under cyclic permutations, we have tr( T D) = tr(d( T )) = D( T ) F The second equality comes the fact that the trace of a matri equals its Frobenius product if at least one of the matrices is symmetric. Our SDP will solve for X T : 5

6 ma DX F s.t. X ii = 1 X 0 If p is sufficiently larger than q, then with high probability our relaed SDP above will have a rank-1 solution gg T with g {±1} n. The coordinates of g will indicate to which partition each verte belongs. 6.2 Analysis We want to show that gg T will be the unique solution to SDP. To show that gg T is an optimal solution, Abbe et al [1] show that Bgg T equals Y, some feasible solution to the dual of the SDP: min tr(y ) s.t. Y B Y diag Net, define two diagonal matrices storing the within-partition and between-partition degrees of each verte: D uu + = num. edges from u to another verte in the same partition Duu = num. edges from u to another verte in a different partition We can compute Y directly by rewriting Bgg T in terms of D + and D. (Bgg T ) uu = (num. edges within partitions) + (num. non-edges between partitions) (1) (num. edges between partitions) (num. non-edges between partitions) (2) = D uu + + ( n 2 G uu) ( n 2 1 G+ uu) Duu (3) = 2(D uu + Duu) + 1 (4) Setting Y = 2(D + D ) + I n gives a feasible solution to the dual program. To show that gg T unique solution, Abbe et al [1] show that the second eigenvalue of Y is not too small. is the 7 Lower Bounds for Recovery We now study when these two algorithms will succeed with high probability and compare those bounds to information theoretic lower bounds on recovery. First, it is convenient to define a = pn and b = qn. This SDP algorithm succeeds in the regime when (a b) 2 > 8(a + b) (a b). This is substantially better than the spectral algorithm presented in Section 5, which succeeds only when (a b) 2 > 64(a + b). It is important to note that all algorithms fail when a and b are too close. Specifically, no algorithm can recover the partitions when (a b) 2 < 4(a b) 4 and a + b 2.[1] Note that neither of the algorithms presented achieves this lower bound. This lower bound can be shown via an information theoretic argument. Consider the problem of distinguishing the above stochastic block graph from an Erdős-Rényi graph with edge probability 1 2 (p + q). If p q is small, the graph will not contain enough bits of information to distinguish these two graphs. 8 Conclusion In this survey, we introduced two random graph models, Erdős-Rényi models and stochastic block models. We discussed why uncovering the structure of graphs generated by the second model is difficult. We then presented two algorithms for recovering this structure and sketched correctness proofs. We finished by showing that recovery is not always possible and by discussing lower bounds. 6

7 A number of open problems remain in this area. First, how many real-world graphs have distinct enough partitions to be recovered using the presented algorithms? How many fall into the range between the lower bound of Abbe et al [1] and the the lower bounds for the presented algorithms? Second, we can etend the stochastic block model to allow vertices to belong to multiple partitions, creating a mied membership stochastic block model [2]. When is eact recovery possible in this setting, and what are the best algorithms for recovery? Third, what if we only want to partially recover the underlying partitions? Is there a unified approach to analyzing both the eact recovery and partial recovery settings? References [1] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Eact recovery in the stochastic block model. arxiv preprint arxiv: , [2] Edo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mied membership stochastic blockmodels. In Advances in Neural Information Processing Systems, pages 33 40, [3] Paul Erdős and Alfréd Rényi. On random graphs. Publicationes Mathematicae Debrecen, 6: , [4] Anna Goldenberg, Alice X Zheng, Stephen E Fienberg, and Edoardo M Airoldi. A survey of statistical network models. Foundations and Trends in Machine Learning, 2(2): , [5] Bruce Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for community detection on random graphs. arxiv preprint arxiv: , [6] Frank McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, Proceedings. 42nd IEEE Symposium on, pages IEEE, [7] Elchanan Mossel, Joe Neeman, and Allan Sly. Stochastic block models and reconstruction. arxiv preprint arxiv: , [8] Tiago P Peioto. Hierarchical block structures and high-resolution model selection in large networks. Physical Review X, 4(1):011047, [9] Van Vu. A simple svd algorithm for finding hidden partitions. arxiv preprint arxiv: ,

Spectral Clustering and Community Detection in Labeled Graphs

Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu