On Clusterings Good, Bad and Spectral

Size: px

Start display at page:

Download "On Clusterings Good, Bad and Spectral"

Sherman Pierce
6 years ago
Views:

1 On Clusterings Good, Bad and Spectral Ravi Kannan Computer Science, Yale University. Santosh Vempala Ý Mathematics, M.I.T. Adrian Vetta Þ Mathematics, M.I.T. Abstract We propose a new measure for assessing the quality of a clustering. A simple heuristic is shown to give worst-case guarantees under the new measure. Then we present two results regarding the quality of the clustering found by a popular spectral algorithm. One proffers worst case guarantees whilst the other shows that if there exists a good clustering then the spectral algorithm will find one close to it. 1 Introduction Clustering, or partitioning into similar groups, is a problem with many variants in mathematics as well as in the applied sciences. The availability of vast amounts of data has revitalized research on the problem. Over the years, several clever heuristics have been invented for clustering. While many of them are special-purpose, the heuristic known as spectral clustering stands out in that it has been used in a variety of situations, often successfully. This method is described in section 2. Roughly speaking, spectral clustering is the technique of partitioning the rows of a matrix (say) according to their components in the top few singular vectors of the matrix. The main motivation of this paper was to analyze the performance of spectral clustering. However this is inextricably linked to the question of how to measure the quality of a clustering. The heuristics used in practice are usually defended empirically. Theoreticians, meanwhile, have been busy analyzing quality measures that are seductively simple to define Supported in part by NSF grant CCR Ý Supported by NSF CAREER award CCR Þ Supported in part by NSF CAREER award CCR and in part by a Wolfe fellowship. (e.g. -median, min sum, min diameter, etc.). Both approaches have obvious shortcomings: the justification provided by practitioners is typically case-by-case and experimental ( it works well on my data ); the measures thus far analyzed by theoreticians are easy to fool, i.e. there are simple examples where it is clear what the right clustering is, but optimizing the measures defined produces undesirable solutions (see section 3). In this paper we propose a new bicriteria measure of the quality of a clustering, based on expansion-like properties of the underlying pairwise similarity graph. Roughly speaking, the quality of a clustering is given by two parameters:, the minimum conductance of the clusters and, the ratio of the weight of inter-cluster edges to the total weight of all edges. The objective is to find an µ clustering that maximizes and minimizes. Hence, imposing a lower bound for we strive to minimize, or vice versa. In section 3, we argue that although the new measure is more complex, it does not have the drawbacks of earlier quality measures. In section 4 we study an algorithm for finding nearly optimal solutions to the new measure and prove that it yields simultaneous polylogarithmic approximations to both parameters. Then, in section 5 we turn our attention to analyzing spectral clustering under the new measure. We show that it too has worst case guarantees, and also show that if the given data has a rather good clustering (i.e. is large and is small), then the spectral algorithm will find an optimal clustering. Spectral clustering has another attractive feature. Using the randomized algorithms of [7] and [5] it can be implemented extremely quickly (nearly linear time). We discuss this in section 6. 1

2 2 Spectral Clustering Algorithms As stated the primary motivation for this paper was to provide a theoretical analysis of spectral clustering. Spectral clustering refers to the general technique of partitioning the rows of a matrix according to their components in the top few singular vectors of the matrix. The underlying problem, that of clustering the rows of a matrix, is ubiquitous. We mention three special cases that are all of independent interest: to define a measure of the quality of a clustering. In recent years several combinatorial measures of clustering quality have been investigated in detail. These include minimum diameter, -center, -median, and minimum sum (for example: [3],[4],[6], [8],[9], etc.). The matrix encodes the pairwise similarities of vertices of a graph. The rows of the matrix are points in a -dimensional Euclidean space. The columns are the coordinates. The rows of the matrix are documents of a corpus. The columns are terms. The µ entry encodes information about the occurrence of the th term in the th document. Given a matrix, the spectral algorithm for clustering the rows of is given below. SPECTRAL ALGORITHM I Find the top left singular vectors Ù Ù with singular values. Let be the matrix with th column Ù. Place row in cluster if is the largest entry in the th row of. The algorithm has the following interpretation. Suppose the rows of are points in a high-dimensional space. Then we first find the rank subspace that best approximates, and then project all the points into this subspace. This subspace is defined by the top right singular vectors of. Each singular vector represents a cluster; to get a clustering we map each projected point to the (cluster defined by the) singular vector that is closest to it in angle. In section 5, we describe a recursive variant of this algorithm. 3 What is a good clustering? How good is the spectral algorithm? Intuitively, the algorithm performs well if points that are similar are assigned the same cluster and those that are different are assigned to different clusters. Of course, this may not be possible to do for every pair of points, and so we compare the clustering found by the algorithm to the optimal one for the given matrix. This leads to the question of what exactly is an optimal clustering. To provide a quantitative answer, we first need A B Figure 1. Optimizing the diameter produces B while A is clearly more desirable. All these measures, although mathematically attractive due to their simple description, are easy to fool. That is, one can construct examples with the property that the best clustering is obvious, yet an algorithm that optimizes one of these measures finds a clustering that is substantially different and therefore unsatisfactory. Such examples for the diameter and k-median measures are presented in Figs. 1 and 2. In the figures, nearby points represent highly similar points and points that are farther away are less similar. A Figure 2. The inferior clustering B is found by optimizing the -median or min sum measure. Consider a clustering that minimizes the maximum diameter of the clusters, the diameter of a cluster being the largest distance (say) between two points in a cluster. It is NP-hard to find such a clustering, but this is not our main concern. What is worrisome about the example shown in Fig. 1 is that the optimal solution (B) produces a cluster which contains points that should have been separated. The B

3 reason such a poor cluster is produced is that although we have minimized the maximum dissimilarity between points in a cluster, this was at the expense of creating a cluster with many dissimilar points. The clustering (A) on the other hand, although it leads to a larger maximum diameter, is desirable since it better satisfies the goal of similar points together and dissimilar points apart. This problem also arises for -median and minimum sum measures (see, for example, the case shown in Fig. 2); they may produce clusters of poor quality. For the purpose of motivating our proposed measures, let us model the clustering problem via an edge-weighted complete graph whose vertices need to be partitioned. The weight of an edge represents the similarity of the vertices (points) and. Thus, the graph for points in space would have high edge weights for points that are close together and low edge weights for points that are far apart. Note that associated with the graph is an Ò Ò symmetric matrix with entries ; here we assume that the are nonnegative. A cluster will be a subgraph. The quality of a cluster should be determined by how similar the points within a cluster are. In particular, if there is a cut of small weight that divides the cluster into two pieces of comparable size then the cluster has lots of pairs of vertices that are dissimilar and hence it is of low quality. This might suggest that the quality of a subgraph as a cluster is the minimum cut of the subgraph. However, this is misleading as illustrated in Fig. 3. i) ii) Figure 3. The second subgraph is of higher quality as a cluster although it has a smaller mincut. In this example edges represent high-similarity pairs and non-edges represent pairs that are highly dissimilar. The minimum cut of the first subgraph is larger than that of the second subgraph. This is because the second subgraph has low degree vertices. However, the second subgraph is a higher quality cluster. This can be attributed to the fact that in the first subgraph there is a cut whose weight is small relative to the sizes of the pieces it creates. A quantity that measures the relative cut size is the expansion. The expansion of a graph is the minimum ratio over all cuts of the graph of the total weight of edges of the cut to the number or vertices in the smaller part created by the cut. Formally, we denote the expansion of a cut Ë Ëµ by : ËË Ëµ ÑÒ Ë Ëµ We say that the expansion of a graph is the minimum expansion over all the cuts of the graph. Our first measure of quality of a cluster is the expansion of the subgraph corresponding to it. The expansion of a clustering is the minimum expansion of one of the clusters. The measure defined above gives equal importance to all the vertices of the given graph. This, though, may lead to a rather taxing requirement. For example, in order to accommodate a vertex with very little similarity to all the other vertices combined (i.e., is small), then will have to be very low. Arguably it is more prudent to give greater importance to vertices that have many similar neighbors and lesser importance to vertices that have few similar neighbors. This can be done by a direct generalization of the expansion, called the conductance, in which subsets of vertices are weighted to reflect their importance. The conductance of a cut Ë Ëµ in is denoted by : Ëµ ËË ÑÒ Ëµ Ëµµ Here Ëµ Ë Î µ Ë Î. The conductance of a graph is the minimum conductance over all the Ëµ. In order to quan- cuts of the graph; µ ÑÒ ËÎ tify the quality of a clustering we generalise the definition of conductance further. Take a cluster Î and a cut Ë ÒËµ within, where Ë. Then we say that the conductance of Ë in is: Ë µ ËÒË ÑÒ Ëµ Ò Ëµµ The conductance of a cluster µ will then be the smallest conductance of a cut within the cluster. The conductance of a clustering is the minimum conductance of its clusters. This leads to the following optimization problem: given a graph and an integer, find a -clustering of maximum conductance. Notice that optimizing the expansion/conductance gives the right clustering in the examples of Figs. 1 and 2. There is still a problem with the above clustering measure. The graph might consist mostly of clusters of high quality and maybe a few points that create clusters of very poor quality, so that any clustering necessarily has a poor overall quality (since we have defined the quality of a clustering to be the minimum over all the clusters); in fact, to

4 boost the overall quality, the best clustering might create many clusters of relatively low quality so that the minimum is as large as possible. Such an example is shown in Fig. 4. Figure 4. Assigning the outliers leads to poor quality clusters. One way to handle the problem might be to avoid restricting the number of clusters. But this could lead to a situation where many points are in singleton clusters. Instead we measure the quality of a clustering using two criteria, the first is the minimum quality of the clusters (called ), and the second is the fraction of the total weight of edges that are not covered by the clusters (called ). Definition 1 We call a partition Ð of Î an µ-partition if : 1. The conductance of each is at least. 2. The total weight of inter-cluster edges is at most an fraction of the total edge weight. The associated bicriteria optimization problem is the following: Problem 1 Find an µ-partition that simultaneously maximizes and minimizes. Another way to pose the problem is, given, find an µ partition that minimizes. Note that the number of clusters is not restricted. For this measure of cluster quality, we are not aware of any bad examples for the µclustering problem, i.e. examples where solving this bicriteria optimization problem gives a clustering that is clearly inferior to the best clustering. An important question is whether this observation holds up under a systematic experimental study. The focus of the rest of this paper, however, is to consider the measure from a theoretic standpoint and to examine in detail the performance of spectral clustering algorithms. 4 An approximation algorithm Observe that Problem 1 is NP-hard. To see this consider maximising whilst setting to zero. This problem is equivalent to finding the conductance of a given graph. As a result, it may be appropriate to develop approximation guarantees for the problem. Hence, here we present a simple heuristic, outlined below, and show that it is a poly-logarithmic approximation algorithm for our clustering problem. APPROXIMATE-CUT ALGORITHM Find an approximate sparsest cut in. If the conductance of the cut is then output the cluster. Else recurse on the induced pieces. The idea behind our algorithm is simple. Given, find a cut Ë Ëµ of minimum expansion. If the cut has conductance of, say, at least then stop. Otherwise, recurse on the subgraphs induced by Ë and Ë. Clearly the algorithm produces clusters with conductance of. Since finding a cut of minimum conductance is hard, we cannot implement this algorithm in polynomial time in its current form. However, a simple modification allows for a polynomial implementation. Leighton and Rao [10] gave a polynomial time algorithm that finds a cut with conductance at most ÐÓ Ò times the minimum. So, instead of finding a cut of minimum conductance at each step we apply the Leighton and Rao procedure and find an approximate sparsest cut. This modified algorithm, therefore, produces clusters with conductance. ÐÓ Ò Suppose that the graph contains an µ-partition. We now obtain poly-logarithmic guarantees for our problem. First we consider the situation in which the quality of the cluster is given via the expansion measure. Implementing the algorithm with we obtain the following ÐÓ Ò theorem. Theorem 1 If there is an µ-partition, where the quality of a cluster is measured with respect to expansion, then the Approximate-Cut algorithm will find an ÐÓ Ò ÐÓ Òµ-partition. Proof : Now at any time, the algorithm induces a partition of the vertex set Î of graph into Î Î Î, where Î Î. Clearly, upon termination, each cluster has the desired expansion. Let us call the cuts that the algorithm partitions along Ë Ì µ Ë Ì µ, where Ë Ì. Note that Ë Ì is just the subset of Î being partitioned. Let Ð be the optimal µ clustering. Observe that Î µ is twice the total edge weight in the graph, and so Ï Î µ is the weight of the inter-cluster edges in the optimal solution. Now we divide the cuts into two groups; the first group À are the ones with high expansion within clusters. The second group consists of the remaining cuts. The precise definition follows. We use the notation

5 Now, we deal with the not in À. First, suppose that all the cuts together induce a partition of into ; Ö we order them so that is the largest. Every edge between two vertices in which belong to different sets of À Û Á Ë Ì the partition must be cut by some cut Ë Ì µ and conversely, every edge of every cut Ë Ì µ must µ ÑÒ Ë Ì µ ÐÓ Ò have its two end points in different sets of the partition. So, We first deal with the in À. Now, for all À, we have given that has expansion, we obtain that Û Ë Ì µ ÙËÚÌ ÙÚ. We denote by Û Á Ë Ì µ the sum of the weights of the inside cluster edges of the cut Ð Ë Ì µ, i.e., Û Á Ë Ì µ Û Ë Ì µ. Ð Ë ÐÓ Ò Û Á Ë Ì µ ÑÒ Ë Ì µ ÐÓ Ò ÐÐ Û Á Ë Ì µ Ö Û Ò µ ÑÒ Ò µ So, we have, ÑÒ Ë Ì µ Ë (1) For each À, we define another cut Ë ¼ Ì ¼ µ in Ë Ì in the following manner. For each Ð, if Ë Ì, then put all of Ë Ì µ into Ë ¼. If Ë Ì, then put all of Ë Ì µ into Ì ¼. It follows from (1) that ÑÒ Ë¼ Ì ¼ µ Ë. Now, using the fact that Ë Ì µ is an approximately sparsest cut in Ë Ì, we observe that Û Ë Ì µ Ë ÐÓ Ò Û Ë¼ Ì ¼ µ ÑÒ Ë ¼ Ì ¼ µ Û Ë ¼ Ì ¼ µ Ë For any subset Ë of Î, let Ëµ denote the set of intercluster edges incident at a vertex in Ë. Then Û Ë ¼ Ì ¼ µ Û Ë ¼ µµ, since every edge in Ë¼ Ì ¼ µ is an intercluster edge. So we have Û Ë Ì µ ÐÓ Ò Û Ë ¼ µµ Claim 1. For each vertex Ù Î, there are at most ÐÓ Ò values of such that Ù belongs to Ë ¼. Proof of Claim : Fix Ù Î. Let Á Ù Ë ¼ Ë Â Ù Ë ¼ Ò Ë Clearly Á ÐÓ Ò since if Ù Ë Ë (with ), then Ë Ë. Now suppose Â. Suppose also Ù. Then Ù Ì. Also later Ì (or a subset of Ì ) is being partitioned into Ë Ì µ and since Ù Ë ¼ Ò Ë, we have Ì Ë. Thus Ì Ë Ì µ Ì. Thus the size of Ì halves between two successive times that Â. So, Â ÐÓ Ò. Thus we have À Û Ë Ì µ Ï ÐÓ Ò ÑÜ Á Âµ Ù Ï ÐÓ Ò (2) For each vertex Ù in, we have that there can be at most ÐÓ ÐÓ Ò values of such that Ù belongs to the smaller of the two sets Ë or Ì. Furthermore, if, then Ù never belongs to the smaller of the two sets Ë or Ì. So, we have that Ö ÑÒ µ ÐÓ Ò Thus, we have ÐÐ Û Á Ë Ì µ ÐÓ Ò Ð Therefore, from the definition of À, we have À Û Á Ë Ì µ Thus, applying (2), À Also, we have À ÐÓ Ò ÐÐ Û Á Ë Ì µ ÐÐ ÑÒ Ë Ì µ ÑÒ Ë Ì µ Ð Û Á Ë Ì µ À ÑÒ Ë Ì µ Û Á Ë Ì µ Ï ÐÓ Ò (3) Û Ë Ì µ Û Á Ë Ì µµ Ï (4) since each intercluster edge appears in at most 1 cut. Adding (2), (3) and (4), we get the theorem. Consider now the general case in which the quality of a cluster is measured with respect to conductance. Observe that the function µ is an additive function on the subsets of Î. This property, that for disjoint Í and Í we have

6 Í Í µ Í µ Í µ, is akin to the cardinality function. With this observation then we may prove, using similar methods to those of Theorem 1, the following guarantees for the general case. Theorem 2 If there exists an µ-partition, where the quality of a cluster is measured in terms of conductance, then the Approximate-Cut algorithm will produce a ÐÓ µ-partition. ÐÓ We now assess the running times of our algorithms. These are based upon implementing an approximate sparsest cut procedure. The fastest implementation for this procedure, due to Benczur and Karger [2], runs in Ç Ò µ time. In addition, note that the algorithms make less than Ò cuts. Hence our heuristics have total running time Ç Ò µ. This, though, may be too slow for many real-world applications. We discuss a faster, more practical, algorithm in the next section. 5 Performance guarantees for spectral clustering In this section we describe and analyse a one-at-a-time variant of the spectral algorithm. This algorithm, outlined below, has been used in the field of computer vision [12] and also in the field of web search engines [16]. Observe that the algorithm is similar in flavour to the Approximate- Cut algorithm described in the previous section. SPECTRAL ALGORITHM II Normalise and find its nd eigenvector Ú. Find the best ratio cut wrt Ú. If the conductance of the cut is then output the cluster. Else recurse on the induced pieces. Ø. Hence is We now elaborate upon the basic description of this variant of the spectral algorithm. Initally we normalise our matrix so that the rows sums are equal to one. Now at any stage in the algorithm we have a clustering. Then for each Ø we consider the Ø Ø submatrix of restricted to Ø. We normalise by setting to also nonnegative with row sums equal to one. We then find the second eigenvector of. This is the right eigenvector Ú corresponding to the second largest eigenvalue, i.e. Ú Ú. Then order the elements (rows) of Ø decreasingly with respect to their component in the direction of Ú. Given this ordering, say Ù Ù Ù Ö, find the minimum ratio cut in Ø. This is the cut that minimises Ù Ù Ù Ø µ for some, Ö. If this cut has conductance at least then we output the cluster Ø. Otherwise, we recurse on the pieces Ù Ù and Ø Ò Ù Ù. For simplicity of analysis the we consider a slightly modified algorithm. If at any point we have a cluster Ø with the property that Ø µ Î Ò µ then we split Ø into singletons and output these vertices. Notice that this later operation may be performed at small cost since such singletons are of little importance. Observe, also, that our algorithm performs at least as well as this modified version. We now ready to analyse the performance of our spectral algorithms. We do this in two ways. First we give worst case performance guarantees for the one-at-a-time variant. Then we show that both the spectral algorithms have the property that given the existence of a good clustering, the algorithm will find one close to it. 5.1 Worst-Case Guarantees We will require the following theorem. This result was essentially proved by Sinclair and Jerrum (in their proof of Lemma 3.3 in [13], although not mentioned in the statement of the lemma). For completeness and due to the fact that Theorem 3 is usually not explicitly stated in the Markov Chain literature (and also usually includes some other conditions which are not relevant here), we include a proof of this result in the appendix. Theorem 3 Suppose is an Æ Æ matrix with nonnegative entries with each row sum equal to 1 and suppose there are positive real numbers Æ summing to 1 such that for all. If Ú the right eigenvector of corresponding to the second largest eigenvalue, and Æ is an ordering of Æ so that Ú Ú Ú Æ, then ÑÒ ËÆ ¼ ÑÒ ÐÐÆ Ë Ë ÑÒ Ë µ Ë ÙÐÐ ÚÆ Ô ÑÒ Ù ÙÐ Ù ÙÚ Ð ÚÆ Ú µ Now, implementing the algorithm with, we ÐÓ may obtain worst case guarantees in a similar manner to those given by Theorems 1 and 2. Theorem 4 If has a µ partition, then Spectral Algorithm II will find an ÐÓ ¼Ô ÐÓ µ-partition. Proof : Let us call the cuts the algorithm partitions along Ë Ì µ Ë Ì µ, where we adopt the convention that

7 Ë is the smaller side, i.e., Ë µ Ì µ. Let Ð be the µ solution. We divide the cuts into two groups; the first group À are the ones with high expansion within clusters and the second group the rest. À Û Á Ë Ì µ ÐÓ Ð We first deal with the in À. For all À, we have, Ë µ ÐÓ So, we have, Û Ë Ì µ Û Á Ë Ì µ ÐÓ ÑÒ Ë µ Ì µµ ÑÒ Ë µ Ì µµ Ë µ For each À, we define another cut Ë ¼ Ì ¼ µ in Ë Ì in the following manner. For each Ð, if Ë µ Ì µ, then put all of Ë Ì µ into Ë ¼. If Ë µ Ì µ, then put all of Ë Ì µ into Ì ¼. Notice that if Ë µ Ë ¼ µ Ë µ then we have ÑÒ Ë ¼ µ Ì ¼ µµ Ë µ. Now we will use the conductance theorem to get an upper bound on Û Ë Ì µ in terms of Û Ë ¼ Ì ¼ µ. To this end, first note that if we let Ù Ùµ ÚÎØ Úµ for all Ù Î Ø, then we have Ù ÙÚ Ú ÚÙ for all Ù Ú Î Ø for the matrix in step (ii) of the algorithm. (This follows from the symmetry of.) Note also that in steps (iii), (iv), we order the elements of Î Ø according to the component in the second largest eigenvector of and take Ë Ì µ to be the contiguous cut with the minimum conductance in this order. So, applying Theorem 3, we obtain Û Ë Ì µ ÑÒ Ë µ Ì µµ ÙË ÚÌ ÙÚ ÑÒ Ë µ Ì µµ ÙË ÚÌ Ù ÙÚ ÑÒ Ë µ Ì µµ Ô µµ ÙË ¼ ÚÌ ¼ Ù ÙÚ ÑÒ Ë ¼ µ Ì ¼ µµ ÙË ¼ ÚÌ ¼ ÙÚ ÑÒ Ë ¼ µ Ì ¼ µµ For any subset Ë of Î, let Ëµ denote the set of intercluster edges incident at a vertex in Ë. Then Û Ë ¼ Ì ¼ µ Û Ë ¼ µµ, since every edge in Ë¼ Ì ¼ µ is an intercluster edge. So we have from (5), Õ Û Ë Ì µ ÕÛ Ë ¼ ÑÒ Ë µ Ì µµ µµ Ë µ (6) Ò Claim 2. For each vertex Ù Î, there are at most ÐÓ values of such that Ù belongs to Ë. Further, there are at Û Ë ¼ Ì ¼ µ (5) Ë µ most ÐÓ Ò values of such that Ù belongs to Ë¼. Proof of Claim : Fix Ù Î. Let Á Ù Ë Â Ù Ë ¼ Ò Ë Clearly if Ù Ë Ë (with ), then Ë Ì µ must be a partition of Ë or a subset of Ë. Now we have, Ë µ Ë Ì µ Ë µ So Ë µ reduces by a factor of 2 or greater between two successive times Ù belongs to Ë. The maximum value of Ë µ is at most Î µ and the minimum value is at least Ò Î µ, so the first statement of the claim follows. Now suppose Â. Suppose also Ù. Then Ù Ì. Also later Ì (or a subset of Ì ) is being partitioned into Ë Ì µ and since Ù Ë ¼ Ò Ë, we have Ì µ Ë µ. Thus Ì µ Ë Ì µ Ì µ. Thus Ì µ halves between two successive times that Â. So, Â ÐÓ. This proves the second statement in the claim (since Ù Ë ¼ implies that Ù Ë or Ù Ë ¼ Ò Ë ). Thus, using the Cauchy-Schwartz inequality, we have À Û Ë Ì µ (7) Õ ÕÛ Ë ¼ µµ Ë µ ÐÐ ¼ ¼ Û Ë ¼ µµ Ë µ ÐÐ ÐÓ ÐÓ Î µ Î µ Ô ÐÓ Î µ (8) Now, we deal with the not in À. First, suppose that all the cuts together induce a partition of into Ö. Every edge between two vertices in which belong to different sets of the partition must be cut by some cut Ë Ì µ and conversely, every edge of every cut Ë Ì µ must have its two end points in different sets of the partition. So, given that has high conductance, we obtain ÐÐ Û Á Ë Ì µ Ö Û Ò µ

8 ÑÒ µ Ò µµ For each vertex Ù there can be at most ÐÓ values of such that Ù belongs to the smaller (according to µ) of the two sets Ë Ì. So, we have that Ö Thus, we have ÐÐ ÑÒ µ Ò µµ ÐÓ Û Á Ë Ì µ ÐÓ ÑÒ Ë µ Ì µµ Ð Therefore, from the definition of À, we have À Û Á Ë Ì µ ÐÓ ÐÐ Thus, applying (8), À Also, we have À Ð ÐÐ Û Á Ë Ì µ Û Á Ë Ì µ ÑÒ Ë µ Ì µµ ÑÒ Ë µ Ì µµ Û Á Ë Ì µ À Ô ÐÓ Î µ (9) Û Ë Ì µ Û Á Ë Ì µµ Î µ (10) since each intercluster edge belongs to at most one cut Ë Ì. We add up (8), (9) and (10). Also note that splitting up all the Î Ø with Î Ø µ Î µ into singletons costs Ò us at most Î µ on the whole. Recalling that Î µ equals twice the total sum of edge weights, the bound regarding the cost of intercluster edge weights follows. Finally note that at the end each part has conductance at least ÐÓ, where this is trivially true of singleton sets. The theorem follows. 5.2 In the Presence of a Good Clustering Suppose a given input matrix has a particularly good clustering, i.e. it can be partitioned into blocks such that the conductance of each block as a cluster is high and the total weight of inter-cluster edges is small. We model this situation as follows. We assume that is normalized so that each row sum is 1. The matrix is written as where is a block diagonal matrix, and is the set of edges that run between clusters. Let be block-diagonal with blocks that correspond to the clusters of the optimal clustering. Rather than conductance, it will be easier to state the theorem in terms of the minimum eigenvalue gap of the blocks. The eigenvalue gap of matrix is. The eigenvalue gap is closely related to the conductance ( ). Ideally we would like to show that if where is large for each block in and the total weight of edges in is small then the spectral algorithm works. Theorem 5 shows that if the s are large and the -norm of is small then the spectral algorithm finds a nearly optimal clustering. The -norm of an Ò Ñ matrix is ÑÜ Ü ÜÊ Ñ Ü We are now ready to prove the following theorem. Analyses in a similar spirit have been conducted by Papadimitriou et al. [11] and Fiat et al. [1]. Theorem 5 Let the eigenvalue gap of each cluster,, of the optimal clustering be at least (¼ ). In addition, let the difference between the th and th eigenvalues of also be at least. Suppose has -norm at most Æ where Æ for some constant. Then, if the optimal cluster sizes are within a bounded ratio, the spectral algorithm applied to finds a clustering that differs from the optimal in Ç Òµ rows. Proof : The matrix can be viewed as a perturbation of by the matrix. Let the first eigenvectors of (resp. ) be denoted by (resp. ), and the last Ò by (resp. ). Let µì for. Similarly define µì. Then ¼ and is the diagonal matrix of the the top eigenvalues of. Now define µì. Then a theorem of Stewart [15] (Theorem 4.11, page 745) tells us that, for some matrix with -norm at most, the columns of µ Á Ì µ form an orthonormal basis for a subspace that is invariant for. Hence there is a orthonormal matrix Ê such that Ê. It follows that Ê, where is a matrix with -norm,, less than. We may take to be the orthonormal basis for a subspace that is invariant for and such that.

9 Recall that we partition our rows with respect to the Ò matrix. Hence, we are interested in the difference between the clusterings induced by and. From the previous para we have that has -norm Ç µ. By a suitable rotation, we can take to be the matrix whose th column has support as the rows of. Further the entries of the th column are all close to Ô where Ò Ò is the number of rows in the th block. Now consider a row that is misclassified by. Suppose it belongs to block in the optimal clustering. Then (in the matrix ), in some column Ð other than it has an entry higher than it does in column. This implies a dif- ference of at least Ô between either Ò and or between Ð and Ð. By our assumption that the block sizes are roughly equal and the bound of Ç µ on the -norm of the difference between and, we have that in total at most Ç Òµ rows are misclassified, i.e. the clustering produced differs from the optimal one in Ç Òµ rows. Theorem 5 can be generalized to the case of nonsymmetric matrices, using a corresponding theorem of Stewart for the invariance of left and right singular subspaces. 6 Conclusion There are two aspects to analyzing a clustering algorithm: (i) Quality; how good is the clustering produced? and (ii) Speed; how fast can it be found? In this paper we have mostly dealt with (i) while taking care that the algorithms are polynomial time. The spectral algorithms, depend on the time it takes to find the top (or top ) singular vector(s). While this can be done exactly in polynomial time, it is still too expensive for applications such as information retrieval. The recent work of [7] and [5] on randomized algorithms for low-rank approximation tackles this problem. The running time of their first algorithm depends only on the quality of the desired approximation and not on the size of the matrix, but it assumes that the entries of the matrix can be sampled in a specific manner. Their second algorithm needs no assumptions and has a running time that is linear in the number of non-zero entries. References [1] Y. Azar, A. Fiat, A. Karlin, F. McSherry and J. Saia. Spectral analysis of data. Manuscript. [2] A. Benczur and D. Karger. Approximate Ø mincuts in Ç Ò µ time. In Proc. of 28th STOC, p47-55, [3] M. Charikar, C. Chekuri, T. Feder and R. Motwani. Incremental clustering and dynamic information retrieval. In Proc. of 29th STOC, [4] M. Charikar, S. Guha, D. Shmoys and E. Tardos. A constant-factor approximation for the -median problem. In Proc. of 31st STOC, p1-10, [5] P. Drineas, A. Frieze, R. Kannan, S. Vempala and V. Vinay. Clustering in large graphs and matrices. In Proc. of 10th SODA, [6] M.E. Dyer and A. Frieze. A simple heuristic for the Ô- center problem. In Oper. Res. Let., 3, , [7] A. Frieze, R. Kannan and S. Vempala. Fast Monte- Carlo algorithms for finding low-rank approximations. In Proc. of 39th FOCS, [8] P. Indyk. A sublinear time approximation scheme for clustering in metric spaces. In Proc of 40th FOCS, p154-59, [9] K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems. In Proc. of 40th FOCS, p2-13, [10] T. Leighton and S. Rao. An approximate max-flow min-cut theorem for uniform multicommodity flow problems with applications to approximation algorithms. In Proc. of 28th FOCS, p256-69, [11] C. Papadimitriou, P. Raghaven, H. Tamaki and S. Vempala. Latent semantic indexing: a probabilistic analysis. In Proc. of 17th Symposium on the Principles of Database Systems, [12] J. Shi and J. Malik. Normalized cuts and image segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition, [13] A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains. In Information and Computation, 82, p93-133, [14] D. Spielman and S. Teng. Spectral Partitioning Works: Planar Graphs and finite element meshes. In Proc. of 37th FOCS, [15] G. W. Stewart, Error and perturbation bounds for subspaces associated with certain eigenvalue problems. In SIAM review, 15(4), [16]

10 A Proof of Conductance Inequality Theorem 6 Suppose is a Æ Æ matrix with nonnegative entries with each row sum equal to 1 and suppose there are positive real numbers Æ summing to 1 such that for all. If Ú the right eigenvector of corresponding to the second largest eigenvalue, and Æ is an ordering of Æ so that Ú Ú Ú Æ, then ÑÒ ËÆ ¼ ÑÒ ÐÐÆ Ë Ë ÑÒ Ë µ Ë ÙÐÐ ÚÆ Ô ÑÒ Ù ÙÐ Ù ÙÚ Ð ÚÆ Ú µ Proof : Let µ. Ì and so É is symmetric. The eigenvalues of É are the same. Their largest eigenvalue is 1. Also Ì É Ì and so Ì is the eigenvector of É corresponding to eigenvalue 1. So, we have, Thus ÑÒ Ì Ü¼ Now ÑÒ Ì Ý¼ Ý Ì Á µý So Ü Ì Ü ÑÜ Ì Ü¼ Ü Ì Ü Ü Ì Á µ Ü Ü Ì Ü Ý Ì Á µý Ý Ì ÔÙØØÒ Ý Üµ Ý Ý Ý Ý Ý Ý Ý Ýµ Ýµ Ý µ Ý Ýµ ÑÒ Ì Ý¼ Ý µý Ý Ý If Ý Ú attains the minimum above, then Ú is the eigenvector of É corresponding to eigenvalue and so Ú is the right eigenvector of corresponding to. Our ordering will be according to Ú as in the statement of the Theorem. Assume that for simplicity of notation, we have reordered the indices (of rows and corresponding columns of ) so that now we have Ú Ú Ú Ú Æ. Define Ö to satisfy Ö Ö. Let Þ Ú Ú Ö for Ò. Then and Ú Úµ Ú Þ Þ Þ Ö ¼ Þ Ö Þ Ò Þ Þµ Ú Ö Þ Þ Þµ Þ Þ Þ Ýµ Now, by Cauchy-Schwartz ¼ ¼ Þ µ Þ Þ µ Þ Þ µ Þ Þ Þ Þ µ Þ Þ (11) Here the second inequality follows from the observation that if then Þ Þ Þ Þ µ Þ Þ. Since if Ö i.e. if Þ Þ have the same sign then Þ Þ Þ Þ µ Þ Þ. Otherwise Þ Þ Þ Þ µ Þ Þ µ Þ Þ and the observation follows. Also, So, Ú Úµ Ú Þ Þ µ Þ Þ µ Þ Þ Þ µ Þ

11 Now let Ë and µ. Let µ ÑÒ Æ ÑÒ µ Then Æ Ö Þ Þ Þ Þ µ Þ Þ µ Ë µ Æ Ö Æ Þ Þ µ Ë µ Þ Æ Þ Ö µ Æ Þ since Þ Ö ¼. Thus if Ì Ý ¼ then Þ Þ µ Ë µµ Ú Úµ Ú

Online Facility Location

Online Facility Location Adam Meyerson Abstract We consider the online variant of facility location, in which demand points arrive one at a time and we must maintain a set of facilities to service these