Spectral Clustering and Community Detection in Labeled Graphs

Size: px

Start display at page:

Download "Spectral Clustering and Community Detection in Labeled Graphs"

Ronald Hood
5 years ago
Views:

1 Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu Abstract We study spectral clustering techniques to learn community structures in labeled random graphs where edge labels from a label set L = {1,..., L} are drawn according to discrete probability distributions parametrized by community membership of the two end-nodes of the edge. This is a strict generalization of the standard stochastic block model for community detection. 1 Introduction and Related Work Graph partitioning is a fundamental problem in computer science first posed as the minimum cut problem: given a graph G = (V, E), one tries to find a partitioning of the vertices into two clusters S 1, S 2 such that S 1 S 2 = V and S 1 S 2 = so as to minimize the number of edges (or the total weight of edges) crossing between the clusters. This problem is naturally generalized into the question of generic clustering on a graph: given a graph, can we find a partitioning S 1,..., S k so that k i=1 S i = V and k i=1 S i = so within clusters the graph is dense and between clusters the graph is sparse. A widely successful genre of algorithms for graph partitioning problems is spectral clustering. As evidence, the following technical report giving a tutorial on spectral clustering methods has well over 3,000 citations [1]. Broadly speaking, spectral clustering algorithms use the global information given by the eigenvectors of the adjacency matrix of the graph to embed the graph to a lower dimensional space, i.e., a low rank approximation of the adjacency matrix. Other successful spectral algorithms operate by finding an eigen decomposition of the Laplacian matrix of the graph. One application of spectral clustering of graphs in machine learning is the community detection problem. Here, the assumption is that a graph (for example, the Facebook friend graph) represents relationships between persons, and clusters correspond to communities of friends or related people. A probabilistic model is assumed to generate the graph; the stochastic block model has been the most widely studied for such stochastic process. In the simplest case of two communities, the stochastic block model simply implies that every edge exists with probability p if the two endpoints belong to the same community, or q otherwise, with p > q. The machine learning problem is to recover the community structure given a graph generated by this stochastic process. In an influential result [2], McSherry used spectral clustering techniques to show that community detection can be efficiently accomplished within this simple stochastic block model. Very recently, work has been done to generalize the stochastic block model. Specifically, in [3], the authors work with the generalized stochastic block model; the substantial deviation is that now edges are allowed to have labels (which we will think of as weights), which are assumed to be drawn from an unknown distribution generated by a symmetric kernel on latent variables of the endpoints. The question of whether it is possible to recover the distributions from which the labels are drawn given a partially observed graph is answered in the affirmative, given that the graph is sufficiently 1

2 (logarithmic in V ) dense. However, given the generality of the model, this method is not able to recover the latent variables on the vertices themselves. In this work, we address a model of practical interest in between the basic stochastic block model and the fully generalized model called the labeled stochastic block model: edges will still have labels drawn probabilistically, but we will still assume a standard community structure and attempt to recover this latent state from a graph. This has very recently been attempted in [4]. However, while the authors in [4] focus on semidefinite programming relaxations, we will primarily consider spectral techniques based on reducing the labeled instance to an appropriate instance for the use of McSherry s algorithm. The justification is that McSherry s algorithm and similar simple spectral techniques are easy to implement and robust in practice. We see two immediate applications of understanding community detection in the labeled instance. The first is to infer social network community structure (for analytics, friend reccomendation, etc) when an interaction graph is available instead of just a friend graph, i.e., when the number of interactions in some time period between all pairs of users is available. The second application is discovering network hotspots for diagnosis in a distributed system: given bandwidth information on all connections between servers, what are the communities of servers that are passing excessive information between one another? 2 Methods Formally, we adopt the following labeled stochastic block model for the generation of random labeled graphs G = (V, E) with n nodes. Every vertex v in G belongs to one of k communities; let Ψ : {1,..., n} {1,..., k} denote this community labeling on vertices. p and q denote the intracommunity and inter-community probabilities of drawing an edge respectively; similarly, P and Q represent multinomials on the label set L = {1,..., L} corresponding to probabilities of drawing particular edge labels for intra-community and inter-community edges respectively. A draw of a random graph G from this stochastic process can be generated by the following steps. 1. For every vertex, draw a community from 1 to k according to an arbitrary distribution. 2. For every pair of vertices i, j, the edge (i, j) exists with probability p if i and j are in the same community and q otherwise. 3. For all edges in the graph, we draw a label from L = {1,..., L} according to multinomial distribution P = {p 1,..., p L } if i and j are in the same community and according to multinomial distribution Q = {q 1,..., q L } otherwise. We are given a single draw of a graph Ĝ from the random model and we want to recover the complete labeling Ψ (up to isomorphism) with high probability. We begin by giving McSherry s algorithm for the simpler case where we do not draw edge labels [2]. We assume that the value of k (the number of communities) is known; otherwise it is necessary to search over likely values of k and pick the best clustering. Essentially we will compute the best low rank (specifically, rank k approximation) of the adjacency matrix of Ĝ and cluster vertices according to that low rank approximation. 1: procedure CLUSTER(Ĝ, τ, k) 2: Randomly divide the vertices into sets two sets 3: Let Â and ˆB be the columns of Ĝ for the vertices in the first and second set respectively 4: Let P 1 = Project( ˆB, k, τ); let P 2 = Project(Â, k, τ) 5: Let Ĥ be the matrix whose columns are those of (P1)(Â) and (P 2)( ˆB) 6: while U Ĝ unpartitioned do 7: Arbitrarily choose u i U 8: for v Ĝ do 9: 10: if Ĥu i Ĥv 2 τ then ˆΨ(v) = i 11: Return ˆΨ 2

In the Cluster algorithm, τ is a small similarity parameter. Ĥ v refers to the column of Ĥ corresponding to vertex v in the transformed adjacency matrix.

For an instance (Ψ, p, q, k) of the unlabeled problem, there is a constant c so that for sufficiently large n if p q log (n/δ) > c p pn then Cluster can recover Ψ with probability 1 δ [2].

To do this, we introduce a threshold parameter t and reduce a labeled instance to an unlabeled instance by deleting all edges with labels below t and removing labels from all edges with labels at or

Thus, we introduce a heuristic for selecting a good value of t: we search for a value of t which results in a clustering via McSherry s algorithm for which the expected values of p and q are well

3 In the Cluster algorithm, τ is a small similarity parameter. Ĥ v refers to the column of Ĥ corresponding to vertex v in the transformed adjacency matrix. Project( ˆB, k, τ) returns a projection operator (in the form of a matrix) which is an approximation to the projection onto B. We omit the simple details of this sub procedure here. Theorem 1. For an instance (Ψ, p, q, k) of the unlabeled problem, there is a constant c so that for sufficiently large n if p q log (n/δ) > c p pn then Cluster can recover Ψ with probability 1 δ [2]. We want to generalize from the Cluster algorithm to solve the community detection problem (i.e., to recover Ψ) for labeled graphs. To do this, we introduce a threshold parameter t and reduce a labeled instance to an unlabeled instance by deleting all edges with labels below t and removing labels from all edges with labels at or above t. Some values of t may fail because they produce reduced graphs for which the equivalent values of p and q are not well separated as required by theorem 1. Thus, we introduce a heuristic for selecting a good value of t: we search for a value of t which results in a clustering via McSherry s algorithm for which the expected values of p and q are well separated. We justify the reasoning of this heuristic experimentally. 1: procedure LABELEDCLUSTER(Ĝ, τ, k) 2: for t from 1 to L do 3: Compute Ĝt; delete edges of Ĝ with labels below t and remove all labels 4: Ψ t = Cluster(Ĝt, τ, k) ] 5: p t, q t = E [p t, q t Ĝt, Ψ t 6: t = argmax t (p t q t ) 7: Return Ψ t For example, Figure 1 shows a graph whose labels are represented by colors; each instance from left to right shows the remaining graph passed to the unlabeled Cluster algorithm for increasing values of the threshold parameter t. Green and red vertices correspond to different communities; the intuition of the heuristic is that the easiest instance of this graph for the Cluster algorithm should be the second or third, which should also be the instance where the expected values of p and q are best separated. Figure 1: labeled Cluster example 2.1 Improving the Running Time Naively the algorithm has linear dependence on the value of L. Under certain assumptions on the distributions P and Q, we can improve this to depend only on log L for instances where the label set is large. This arises from a noisy binary search over the label values to use as a threshold, made possible when the likely p q values exhibit concave structure with respect to the threshold. For instance, this occurs when P and Q are well separated symmetric distributions like discrete approximations of Gaussians (Figure 2), or monotone distributions etc. The binary search is noisy because for any given threshold, there is some probability that the clustering algorithm fails and the p and q values computed are nonsense. Consequentially, the binary search should consider median separation of p and q for small neighborhoods of threshold values 3

Figure 2: Two discrete approximations of Gaussians (see Figure 3).

the clustering algorithm may fail with non negligible probability, the neighborhoods should be larger.

results with a naive implementation of McSherry s algorithm[2] on synthetic graphs.

We generated these random graphs with 2 partitions and the goal is to recover these partitions as closely as possible.

More specifically, for every pair of nodes, we compare their partitions with the true partitions, i.e., whether they are in the same partitions or not.

4 Figure 2: Two discrete approximations of Gaussians (see Figure 3). The size of these neighborhoods considered can be scaled: for fast execution set the neighborhoods to be size 1; for careful execution when the clustering algorithm may fail with non negligible probability, the neighborhoods should be larger. Figure 3: Binary search over groups of points 3 Results and Discussion We implemented the proposed algorithm LabeledCluster and compared the results with a naive implementation of McSherry s algorithm[2] on synthetic graphs. We generated random graphs of different sizes (nodes) with varying intra-cluster and inter-cluster edge probabilities. We generated these random graphs with 2 partitions and the goal is to recover these partitions as closely as possible. We measured the error as the difference between the true partitions and the computed partitions. More specifically, for every pair of nodes, we compare their partitions with the true partitions, i.e., whether they are in the same partitions or not. Based on this, we measure the partition error as the proportions of pairs of nodes that are incorrectly partitioned with respect to the ground truth. Since, we generated the random graphs with known partitions, we consider it to be the ground truth. All of the results are averaged over 10 runs. (a) Without edge labels (b) With edge labels Figure 4: Partition error on synthetic graphs 4

5 We measured the partitioning error of McSherry s algorithm on random graphs for different values of edge probabilities. Here, p denotes the probability of having an edge between two nodes that are in the same partition, and q denotes the probability of having an edge between two nodes that are in the different partitions. We compute error with different p and q as shown on the x axis. As expected, Figure 4a shows that the partitioning error is low when p and q are well separated, and tend to get higher when this separation diminishes. We varied the number of nodes in a graph from 10 to 1000 and found that for a fixed number of partitions, the error reduces with increase in number of nodes. For instance, after fixing p and q to 0.6 and 0.4 respectively, the error on a graph with 10 nodes is 0.47, while, the error on a graph with 1000 nodes is only McSherry s algorithm gives the guarantee of success but does not give any bound on the error when the algorithm fails. On our preliminary results, we found that when there is a good separation between p and q, the algorithm produces partitions with low error even in the cases of failure. Figure 4b shows the comparison of LabeledCluster (dynamic partition) and McSherry s algorithm (fixed partition) on a random graph with 3 edge labels - {l 0, l 1, l 2 } where l 0 represents no edge. The corresponding label probabilities are fixed as p = {0, 0.2, 0.8} and q= {0.5, 0.5, 0)}. As we can see form the figure, the dynamic partitioning scheme performs much better than the fixed partitioning scheme for this specific instance. As explained before, the dynamic partitioning scheme tries all possible thresholds and selects the one that maximizes the distance between intra-cluster and inter-cluster probabilities, which we heuristically take as the best partition. Note, that we show this only for the case of 2 partitions and with a well separated edge probabilities. However, we believe that dynamic partitioning method should give better results compared to the fixed partitioning scheme even in the case of multiple partitions as long as the edge probabilities are well separated. 4 Future Work For future work, it would be interesting to push the implementation of community detection in labeled graphs to real data and or data with large label sets to validate the idea of log L dependent community detection. Furthermore, for real data, we are interested in whether exploiting label information is useful in the sense that it provides better clustering for applications where label information is available compared to other techniques that do not consider label information. Another interesting direction is to use our algorithm to get better clusters than McSherry s algorithms even without prior knowledge of the labels of the graph. Instead of labels we can set weights on the edges taking advantage of the structure of the graph. For example, the weight of an edge (u, v) can be the number of common friends between nodes u and v. So for each edge we find a weight computing the number of common friends between the incident nodes. If we cluster the weights to k levels (the first level contains the highest weights, the last one the lowest weights etc.) then the label of an edge can be defined as the level of its weight. If the number of nodes is sufficiently large, it is possible to have a better separator (in expectation) than the simple McSherry s algortihm. Finally, it is of theoretical interest to ask whether the heuristic of well separated expected p and q values can be justified mathematically as well as intuitively and experimentally. 5 Conclusion The stochastic block model can be fully generalized to describe the generation of nearly arbitrary random matrices with edge labels. However, discovering the latent variables on vertices is not possible in the general model; therefore the classic community detection problem in the generalized model is impossible. However, a slight relaxation of the generalized stochastic block model to a discrete labeled version allows us to model weighted graphs generated according a stochastic process and still solve community detection. We do this by reducing to an unlabeled instance and applying the classic spectral clustering technique from McSherry [2] to recover the latent variables (communities). We have also implemented our algorithm on synthetic data for small label sets and we presented some preliminary results. References [1] Ulrike von Luxburg. A tutorial on spectral clustering. Technical report, Max Planck Institute,

6 [2] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, FOCS 01, [3] J. Xu, L. Massoulié, and M. Lelarge. Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results. In Conference on Learning Theory, COLT 14, [4] M. Lelarge, L. Massoulié, and J. Xu. Reconstruction in the Labeled Stochastic Block Model. ArXiv e-prints, February

My favorite application using eigenvalues: partitioning and community detection in social networks

My favorite application using eigenvalues: partitioning and community detection in social networks Will Hobbs February 17, 2013 Abstract Social networks are often organized into families, friendship groups,