Spectral Clustering: A Graph Partitioning Point of View Yangzihao Wang Computer Science Department, University of California, Davis yzhwang@ucdavis.edu Abstract This course project provide the basic theory of spectral clustering from a graph partitioning point of view. It also derives two typical spectral clustering algorithms: ratiocuts and normalized-cuts. We propose experiments on large web-graphs and discuss/analyse the results. Finally, we summarize the algorithms we have used and discuss the possibility and possible issues of using parallel computing to improve the performance. I. Introduction The clustering in general is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters). Spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. It refers to a set of heuristic algorithms, all based on the overall idea of computing the first few singular vectors and then clustering in a low (in certain cases simply one) dimensional subspace. I.1 Problem Statement In this course project I focus on using spectral clustering as an approximation solution for k- way graph partitioning problem and try to solve graph partitioning problem for large scale undirected social network graphs using two common spectral clustering techniques: ratio-cuts and normalized-cuts. II. Basic Theory of Spectral Clustering In this section I discuss the mathematical objects used in spectral clustering and the link between spectral clustering and graph partitioning. I then show how spectral clustering can be derived as an approximation to such graph partitioning problems. II.1 Graph Laplacians Suppose we have an undirected weighted graph G = (V, E). In the spectral clustering algorithm, the vertices in G is a set of vertices needs to be clustered into k clusters. Various ways can be used Second year Ph.D. student working with Prof. John Owens 1
to compute the edge weight between each pair of vertices. These weights form the weight matrix W, where w ij = w ji 0. The degree of a vertex v i V is defined as: d i = n w ij j=1 We can thus define the degree matrix D as the diagonal matrix with the degrees d 1, d 2,..., d n on the diagonal. The unnormalized graph Laplacian matrix is defined as L = D W. and there are two matrices which are called normalized graph Laplacians: L sym = D 1/2 LD 1/2 = I D 1/2 WD 1/2 L rw = D 1 L = I D 1 W. Von [3] s tutorial covers several properties of laplacian matrices. Now with these definitions, we can view the spectral clustering problem from the perspective of graph partitioning. II.2 Graph partitioning point of view Given a similarity graph with adjacency matrix W, the simplest and most direct way to construct a partition is to solve the mincut problem. This consists of choosing the partition A 1, A 2,..., A k which minimizes k cut(a 1,..., A k ) = cut(a i, Â i ) i=1 Since mincut always causes imbalance graph partitioning (e.g. one partition only contains one vertex), the objective function needs to be improved to guarantee that sets A 1,..., A k are "reasonably large". The two most common objective fuctions which encode this are RatioCut and the normalized cut Ncut: RatioCut(A 1,..., A k ) = Ncut(A 1,..., A k ) = k cut(a i, Â i ) A i=1 i k cut(a i, Â i ) vol(a i=1 i ) In RatioCut, the size of a subset A of a graph is measured by its number of vertices A, while Ncut the size is measured by the weights of its edges vol(a). Now we look at RatioCut and Ncut separately. First the relaxation of the RatioCut minimization problem in the case of a general value k looks like this: Given a partition of V into k sets A 1,..., A k, define k indicator vectors h i = (h 1,i,..., h n,i ) by: h i,j = { 1/ Ai if i A j 0 otherwise Then we set the matrix H R n k as the matrix containing those k indicator vectors as columns. Observe that the columns in H are orthonormal to each other, that is H H = 1. h Lh = n w ij (h i h j ) 2 i,j=1 2
Thus we have: = 2 cut( A i, Â i ) A i RatioCut(A 1,..., A k ) = 1 2 = (H LH) ii. k i=1 (H LH) ii. According to Dhillon [1] s paper, ratio cuts problem can be expressed as a trace maximization problem: max A i,...,a k Tr(H LH), subject to H H = I, H ij = h i,j. If we relax the problem by allowing the entries of the matrix H to take arbitrary real values, then the relaxed problem becomes: max H R n k Tr(H LH), H H = I. Von [3] s tutorial also shows the same relaxed problem also works for normalized cuts. normalized cuts we use different laplacian matrix and the problem looks like this: max Tr(U D 1/2 LD 1/2 U), U U = I. U R n k where U = D 1/2 H. A well-known solution to such problem is obtained by computing the top k eigenvectors of the laplacian matrix. These eigenvectors are then used to compute a discrete partitioning of the points. In III. Algorithm In this section, we specify the algorithm we use for this course project. The normalized spectral clustering algorithm is taken from Ng [2] and Dhillon [1] s papers; the unnormalized spectral clustering algorithm is taken from Von [3] s tutorial. Algorithm 1 Unnormalized spectral clustering algorithm 1: procedure RatioCut(W, k) Take a weight matrix and cut the graph into k parts 2: Construct diagonal matrix D 3: L D W Compute the unnormalized Laplacian L 4: Compute top k eigenvectors v 1,..., v k of L 5: Form matrix V R n k be the matrix containing the vectors v 1,..., v k as columns 6: for i 1 to n do 7: y i = V{i, :} let y i R k be the vector corresponding to the i-th row of V 8: for 9: Cluster the points (y i ) i=1,...,n in R k with the k-means algorithm into clusters C 1,..., C k 10: return Clusters A 1,..., A k with A i = j y j C i 11: procedure The normalized spectral clustering algorithm uses a different laplacian matrix L and normalizes the eigenvectors before using k-means to cluster them into k partitions: Both algorithms stated use the same framework and two different graph Laplacians. The main trick is to change the representation of the eigenvector v i to points y i R k. This change of 3
Algorithm 2 Normalized spectral clustering algorithm 1: procedure NCut(W, k) Take a weight matrix and cut the graph into k parts 2: Construct diagonal matrix D 3: L D 1/2 WD 1/2 Compute the normalized Laplacian L 4: Compute top k eigenvectors v 1,..., v k of L 5: Form matrix V R n k be the matrix containing the vectors v 1,..., v k as columns 6: for i 1 to n do 7: y ij = V i,j / V{i, :} let y i R k be the normalized vector corresponding to the i-th row of V 8: for 9: Cluster the points (y i ) i=1,...,n in R k with the k-means algorithm into clusters C 1,..., C k 10: return Clusters A 1,..., A k with A i = j y j C i 11: procedure representation enhances the cluster-properties in the data, so that they can be trivially detected in the new representation. IV. Implementation The implementation of the algorithm is straightforward using MATLAB. Before each algorithm, we first use a simple python script to prepare the graph data into Matrix Market format. In the unnnormalized spectral clustering algorithm the Laplacian matrix is symmetric, in the normalized spectral clustering algorithm the Laplacian matrix is symmetric positive definite. Thus we use MATLAB function eigs() to compute the top k eigenvectors. During the k-means phase we also apply MATLAB function kmeans(). One issue during the implementation is that sometimes when using eigs() to compute eigenvectors, not all eigenvalues converge. Also, because the result of k-means largely deps on the initial condition, the solution of our algorithm is not deterministic. In the worst time, empty clusters will be formed. V.1 Experiment Environment V. Experiment Results The machine we use in this course project has an 1.70GHz Intel(R) Core(TM) i7-2637m CPU with 8G RAM. The code runs under MATLAB R2012b. V.2 Data Sources For this course project, we use 4 graphs to do our experiments as the following graph shows. The first one is generated block diagonal graph which contains 4 connected components, it is to test the correctness of our implementation of the two algorithms. The second and third graph are three undirected graph, one is the Enron email graph, each node denotes a person, one edge between two persons implies email communication between these two persons; the other one is one category of Arxiv collaboration graph, each node is an author in Arxiv network, there is an edge between each two co-authors; the last one is a bi-partite Charity Net graph which records how multiple donors donate for different charities. The first two graphs are unweighted while the last graph is weighted by the amount of donation each donor makes. From the matrix view we 4
can see that both the test block diagonal graph and charitynet graph has different disconnected component, which makes clustering/partitioning much easier, while the other two graphs has either only few connected component (enron-email) or multiple uniformly distributed connected components, which makes clustering/partitioning a relatively difficult task. (a) Test Block Diagonal Graph (b) Enron Email Graph (c) Arxiv Collaboration Graph (d) Charitynet Graph Figure 1: Matrix view of graph topology of the four graphs we use in the course project. V.3 Result and Discussion We first show the result on partitioning the test block diagonal graph. The graph has 4 connected components, and using k = 4 to perform a ratio cut algorithm, we successfully get the optimal partitioning solution with 0 cut. The following table shows the results of performing ratio cut on enron-email graph and arxiv collaboration graph with the k set as 6, 8 and 16. The table shows a great reduction of the total cuts compared with randomly picking edges as cuts in the graph. However, we can see from the table below that the size of partitioning is ill-balanced for the arxiv collaboration and enron email graph. The reason behind this is the topology of graph. As we have stated, the enron email graph contains only few connected components, the arxiv collaboration graph contains multiple 5
uniformly distributed connected components. Stdev of cut number Partition Numer 6 Partition Number 8 Partition Number 16 Arxiv Collaboration 2116.6 1832.4 1294.3 Enron Email 14975 12225 9165.2 (a) Enron Email Graph (b) Arxiv Collaboration Graph Figure 2: Edges of cut and totla edges in graph ratio for enron email and arxiv collaboration graph using ratiocuts algorithm. The second experiment we have done is on a weighted bi-partite graph CharityNet graph. According to the nice topology of the graph itself (with 27 connected components), we get better result in terms of the balance of partitioning when applied normalized cuts algorithm on it. As the following graph shows. Figure 3: Size of 16 partitions when applied ncut algorithm on charitynet graph vs average partition size. The experiment results show that both ratiocut and ncut can give a good approximation of the min-cut graph partitioning problem in terms of reducing the cut size. However, even ncut algorithm performed on weighted graph with multiple connected components still cannot give me perfect result in terms of load balancing of partition size. Also, I find out for dealing with exscale graph data, using clustering or community detection algorithm as a method to do graph partitioning is unnecessary for basic primitives such as breadth first search, shortest path and connected component labeling. VI. Conclusion In this course project I learned about the idea of spectral clustering: construct graph partitions based on eigenvectors of the adjacency matrix. I realized this beautiful theory has been exted 6
to data mining, machine learning and many more fields. The success of spectral clustering is based on its ability to solve very general problems without any assumptions on the form of the clusters. It is also easy to implemented once we define the similarity graph. The main process is to solve a linear problem, there is no issues such as converge cretiria or restarting algorithm with different initializations. However, from the experiments I have done, I find out that defining a good similarity graph and choose the right laplacian matrix is important and has a great influence on the stability of the algorithm. Also, clustering/graph partitioning as a irregular problem is an interesting topic in general purpose GPU computing. Parallel computing is one promising solution to overcome the computational challenge of large scale clustering problem. With the linear algebra tools such as cusparse and cublas, both the building blocks of spectral clustering algorithm: eigenvector computing and k-means algorithm can be implemented on GPU. The issues come with this are the following: 1) An efficient data layout for graph which enhance the performance of irregular memory access 2) Ways to decouple depencies between graph nodes/edges when doing the computation. References [1] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 04, pages 551 556, New York, NY, USA, 2004. ACM. [2] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849 856, 2002. [3] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007. 7
Appix Listing 1: Normalized Laplacian Matrix Code function [ L ] = laplacian( A ) %LAPLACIAN Summary of this function goes here % Detailed explanation goes here row = sum(a,2); row = sqrt(row); one = ones(size(row), 1); row = one./row; n = length(row); D = spdiags(row(:),0,n,n); L = D*A*D; Listing 2: Unnormalized Laplacian Matrix Code function [ L ] = laplacian2( A ) %LAPLACIAN Summary of this function goes here % Detailed explanation goes here row = sum(a,2); n = length(row); D = spdiags(row(:),0,n,n); L = D-A; L = inv(l); Listing 3: Spectral Clustering Code function [ idx,c, sum ] = cluster( L, k ) [V, ]=eigs(l,k); [n1,n] = size(v); for i=1:n, row = V(i,:); norm_row = norm(row); for j=1:k, V(i,j)=V(i,j)/norm_row; [idx,c, sum]=kmeans(v,k); fid = fopen( partition.txt, wt ); for i=1:n1, fprintf(fid, %d\n, idx(i)); fclose(fid); 3