A SURVEY OF CLUSTERING ALGORITHMS FOR GRAPH DATA

Size: px
Start display at page:

Download "A SURVEY OF CLUSTERING ALGORITHMS FOR GRAPH DATA"

Transcription

1 Chapter 9 A SURVEY OF CLUSTERING ALGORITHMS FOR GRAPH DATA Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY charu@us.ibm.com Haixun Wang Microsoft Research Asia Beijing, China haixunw@microsoft.com Abstract Keywords: In this chapter, we will provide a survey of clustering algorithms for graph data. We will discuss the different categories of clustering algorithms and recent efforts to design clustering methods for various kinds of graphical data. Clustering algorithms are typically of two types. The first type consists of node clustering algorithms in which we attempt to determine dense regions of the graph based on edge behavior. The second type consists of structural clustering algorithms, in which we attempt to cluster the different graphs based on overall structural behavior. We will also discuss the applicability of the approach to other kinds of data such as semi-structured data, and the utility of graph mining algorithms to such representations. Graph Clustering, Dense Subgraph Discovery 1. Introduction Graph mining has been a popular area of research in recent years because of numerous applications in computational biology, software bug localization and computer networking. In addition, many new kinds of data such as semi- C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI / _9, Springer Science+Business Media, LLC

2 276 MANAGING AND MINING GRAPH DATA structured data and XML [2] can typically be represented as graphs. In particular, XML data is a popular representation of different kinds of data sets. Since core graph-mining algorithms can be extended to this scenario, it follows that the extension of mining algorithms to graphs has tremendous applicability of a wide variety of data sets which are represented as semi-structured data. Many traditional algorithms such as clustering, classification, and frequent-pattern mining have been extended to the graph scenario. A detailed discussion of various kinds of graph mining algorithms may be found in [15]. In this chapter, we will study the clustering problem for the graph domain. The problem of clustering is defined as follows: For a given set of objects, we would like to divide it into groups of similar objects. The similarity between objects is typically defined with the use of a mathematical objective function. This problem is useful in a number of practical applications such as marketing, customer-segmentation, and data summarization. The problem of clustering is extremely important in a number of important data domains. A detailed description of clustering algorithms may be found in [24]. Clustering algorithms have significant applications in a variety of graph scenarios such as congestion detection, facility location, and XML data integration [28]. The graph clustering problems are typically defined into two categories: Node Clustering Algorithms: Node-clustering algorithms are generalizations of multi-dimensional clustering algorithms in which we use functions of the multi-dimensional data points in order to define the distances. In the case of graph clustering algorithms, we associate numerical values with the edges. These numerical values need not satisfy traditional properties of distance functions such as the triangle inequality. We use these distance values in order to create clusters of nodes. We note that the numerical value associated with a given node may either be a distance value or a similarity value. Correspondingly, the objective function associated with the partitioning may either be minimized or maximized respectively. We note that the problem of minimizing the inter-cluster similarity for a fixed number of clusters essentially reduces to the problem of graph partitioning or the minimum multi-way cut problem. This is also referred to the problem of mining dense graphs and pseudo-cliques. Recently, the problem has also been studied in the database literature as that of quasi-clique determination. In this problem, we determine groups of nodes which are almost cliques. In other words, an edge exists between any pair of nodes in the set with high probability. A closely related problem is that of determining shingles [5, 22]. Shingles are defined as those sub-graphs which have a large number of common links. This is particularly useful for massive graphs which contain a large number of nodes. In such cases, a min-hash ap-

3 A Survey of Clustering Algorithms for Graph Data 277 proach [5] can be used in order to summarize the structural behavior of the underlying graph. Graph Clustering Algorithms: In this case, we have a (possibly large) number of graphs which need to be clustered based on their underlying structural behavior. This problem is challenging because of the need to match the structures of the underlying graphs, and use these structures for clustering purposes. Such algorithms are discussed both in the context of classical graph data sets as well as semi-structured data. In the case of semi-structured data, the problem arises in the context of a large number of documents which need to be clustered on the basis of the underlying structure and attributes. It has been shown in [2] that the use of the underlying document structure leads to significantly more effective algorithms. This chapter is organized as follows. In the next section, we will discuss a variety of node clustering algorithms. Methods for clustering multiple graphs and XML records are discussed in section 3. Section 4 discusses numerous applications of graph clustering algorithms. Section 5 contains the conclusions and summary. 2. Node Clustering Algorithms A number of algorithms for graph node clustering are discussed in [19]. In [19], the graph clustering problem is related to the minimum cut and graph partitioning problems. In this case, it is assumed that the underlying graphs have weights on the edges. It is desired to partition the graph in such a way so as to minimize the weights of the edges across the partitions. In general, we would like to partition the graph into k groups of nodes. However, since the special case k = 2 is efficiently solvable, we would like to first provide a special discussion for this case. This version is polynomially solvable, since it is the mathematical dual of the maximum flow problem. This problem is also referred to as the minimum-cut problem. 2.1 The Minimum Cut Problem The simplest case is the 2-way minimum cut problem, in which we wish to partition the graph into two clusters, so as to minimize the weight of the edges across the partitions. This version of the problem is efficiently solvable, and can be resolved by use of the maximum flow problem [4]. The minimum-cut problem is defined as follows. Consider a graph G = (N,A) with node set N and edge set A. The node set N contains the source s and sink t. Each edge (i,j) A has a weight associated with it which is denoted by u ij. We note that the edges may be either undirected or directed,

4 278 MANAGING AND MINING GRAPH DATA though the undirected case is often much more relevant for connectivity applications. We would like to partition the node set N into two groups S and N S. The set of edges such that one end lies ins and the other lies inn S is denoted by C(S,N S). We would like to partition the node set N into two sets S and N S, such that the sum of the weights in C(S,N S) is minimized. In other words, we would like to minimize (i,j) C(S,N S) u ij. This is the unrestricted version of the minimum-cut problem. We will examine two variations of the minimum-cut problem: We wish to determine the global minimum s-t cut with no restrictions on the membership of nodes to different partitions. We wish to determine the minimum s-t cut, in which one partition contains the source node s and the other partition contains the sink node t. It is easy to see that the former problem can be solved by using repeated applications of the latter algorithm. By fixing s and choosing different values of the sink t, it can be shown that the global minimum-cut may be effectively determined. It turns out that the maximum flow problem is the mathematical dual of the minimum cut problem. In the maximum-flow problem, we assume that the weight u ij is a capacity of the edge (i,j). Each edge is allowed to have a flow x ij which is at most equal to the capacity u ij. Each node other than the source s and sink t is assumed to satisfy the flow conservation property. In other words, for each node i N we have: x ij = x ji (9.1) j:(i,j) A j:(j,i) A We would like to maximize the total flow originating from the source and reaching the sink t, subject to the above constraints. The maximum flow problem is solved with the use of a variety of augmenting-path and preflow push algorithms [4]. In augmenting-path methods, we pick a path from s totwhich has current unused capacity, and increase the flow on this path, such that at least one edge on this path is filled to capacity. We repeat this process, until no path with unfilled capacity exists from source s to sink t. Many different variations of this technique exist in terms of the choice of path used in order to augment the flow from source s to the sink t. Example, include the shortest-paths or maximum-capacity augmenting paths. Different choices of augmenting-paths will typically lead to different trade-offs in running time. These trade-offs are discussed in [4]. In general, the two-way cut problem can be solved quite efficiently in polynomial time with these different methods. It can be shown that the minimum-cut may be determined by determining all nodes S which are

5 A Survey of Clustering Algorithms for Graph Data 279 reachable from s by some path of unfilled capacity. We note that S will not contain the sink node t at maximum flow, since the sink is not reachable from the source with the use of a path of unfilled capacity. The set C(S,N S) is the minimum s-t cut. Every edge in this set is saturated, and the total flow across the cut is essentially equal to thes-t maximum flow. We can then determine the global minimum cut by fixing the sources, and varying the sink node t. The minimum cut over all these different possibilities will provide us with the global minimum-cut value. A particularly important variant of this method is the shortest augmenting-path approach. In this approach we always augment the maximum amount of flow from the source to sink along the corresponding shortest path. It can be shown that for a network containing n nodes, and m edges, the shortest path is guaranteed to increase by at least one after O(m) augmentations. Since the shortest path cannot be larger than n, it follows that the maximum number of augmentations is O(n m). It is possible to implement each augmentation ino(log(n)) time with the use of dynamic data structures. This implies that the overall technique requires at mosto(n m log(n)) time. A second class of algorithms which are often used in order to solve the maximum flow problem are preflow push algorithms, which do not maintain the flow conservation constraints in their intermediate solutions. Rather, an excess flow is maintained at each node, and we try to push as much of this flow as possible along any edge on the shortest path from the source to sink. A detailed discussion of preflow push methods is beyond the scope of this chapter, and may be found in [4]. Most maximum flow methods require at least Ω(n m) time, where n is the number of nodes, and m is the number of edges. A closely related problem to the minimum s-t cur problem is that of determining a global minimum cut in an undirected graph. This particular case is more efficient than that of finding thes-t minimum cut. One way of determining a minimum cut is by using a contraction-based edge-sampling approach. While the previous technique is applicable to both the directed and undirected version of the problem, the contraction-based approach is applicable only to the undirected version of the problem. Furthermore, the contraction-based approach is applicable only for the case in which the weight of each edge is u ij = 1. While the method can easily be extended to the weighted version by varying the edge-sampling probability, the polynomial running time bounds discussed in [37] do not apply to this case. The contraction approach is a probabilistic technique in which we successively sample edges in order to collapse nodes into larger sets of nodes. By successively sampling different sequences of edges and picking the optimum value [37], it is possible to determine a global minimum cut. The broad idea of the contraction-based approach is as follows. We pick an edge randomly in the graph, and contract its two end points into a single node. We remove all self-loops which are created as a result of

6 280 MANAGING AND MINING GRAPH DATA the contraction. We may also create some parallel edges, which are allowed to remain, since they influence the sampling probability 1 of contractions. The process of contraction is repeated until we are left with two nodes. We note that each of this pair of super-nodes corresponds to a set of nodes in the original data. These two sets of nodes provide us with the final minimum cut. We note that the minimum cut will survive in this approach, if none of the edges in the minimum cut are sampled during the contraction. An immediate observation is that cuts with smaller number of edges are more likely to survive using this approach. This is because the edges in cuts which contain a large number of edges are much more likely to be sampled. One of the key observations in [37] is the following: Lemma 9.1. When a graph containing n nodes is contracted to t nodes, the probability that the minimum-cut survives during the contraction is given by O(t 2 /n 2 ). Proof: Let the minimum-cut have k edges. Then, each vertex must have degree at leastk, and therefore the graph must contain at leastn k/2 edges. Then, the probability that the minimum cut survives the first contraction is given by 1 k/(#edges) 1 2/n. This relationship is derived by substituting the lower bound ofn k/2 for the number of edges. Similarly, in the second round of contractions, the probability of survival is given by 1 2/(n 1). Therefore, the overall probability p s of survival is given by: This provides the result. p s = Πi=0 n t 1 (1 2/(n i)) = t (t 1) n (n 1) (9.2) Thus, if we contract to two nodes, the probability of the survival of the minimum cut is 2/(n (n 1)). By repeating the process n (n 1)/2 times, we can show that the probability that the minimum-cut survives is given by at least 1 1/e. If we further scale up by a constant factorc > 1, we can show that the probability of survival is given by 1 (1/e) C. By picking C = log(1/δ), we can assure that the cut survives with probability at least 1 δ, where δ << 1. The logarithmic relationship assures that we can determine minimum cuts with very high probability at a small additional cost. An additional implication of Lemma 9.1 is that the total number of distinct minimum cuts is bounded above byn (n 1)/2. This is because the probability of the survival of any particular minimum cut is at least 2/(n (n 1)), and the probability of the survival of any minimum cut cannot be greater than 1. 1 Alternatively, we may replace parallel edges by a single edge of weight which is equal to the number of parallel edges. We use this weight in order to bias the sampling process.

7 A Survey of Clustering Algorithms for Graph Data 281 Another observation is that the probability of survival of the minimum cut in the first iteration is the largest, and it reduces in successive iterations. For example, in the first iteration, the probability of survival is 1 (2/n), but the probability of survival in the last iteration is only 1/3. Thus, most of the errors are caused in the last few iterations. This is particularly reflected in the cumulative error across many iterations, since the probability of maintaining the correct cut on contracting down totnodes ist 2 /n 2, whereas the probability of maintaining the correct cut in the remaining contractions is1/t 2. Therefore, a natural solution is to use a two-phase approach. In the first phase, we do not contract down to 2 nodes, but we contract down to t nodes. The probability of maintaining the correct cut by the use of this approach is at least Ω(t 2 /n 2 ). Therefore, O(n 2 /t 2 ) contractions are required in order to reduce the graph to t nodes. Since each contraction requires O(n) time, the running time of the first phase is given by O(n 3 /t 2 ). In the second phase, we use a standard maximum flow based method in order to determine the minimum cut. This maximum flow problem needs to be repeatedttimes for a fixed source and different sinks. However, the base graph on which this is performed is much smaller, and contains only O(t) nodes. Each maximum flow problem requires O(t 3 ) time by using the method discussed in [8], and therefore the total time for all t problems is given by O(t 4 ). Therefore, the total running time is given by O(n 3 /t 2 +t 4 ). By picking t = n, we can obtain a running time ofo(n 2 ). Thus, by using a two-phase approach, it is possible to obtain a much better running time, than by using a single-phase contraction approach. The key idea behind this improvement is that since most of the error probability is concentrated in the last contractions, it is better to stop the contraction process when the the underlying graph is small enough, and then use conventional algorithms in order to determine the minimum cut. This combination approach is theoretically more efficient than any other known algorithm. 2.2 Multi-way Graph Partitioning The multi-way graph partitioning problem is significantly more difficult, and is NP-hard [21]. In this case, we wish to partition a graph into k > 2 components, so that the total weight of the edges whose ends lie in different partitions is minimized. A well known technique for graph partitioning is the Kerninghan-Lin algorithm [26]. This classical algorithm is based on a hillclimbing (or more generally neighborhood-search technique) for determining the optimal graph partitioning. Initially, we start off with a random cut of the graph. In each iteration, we exchange a pair of vertices in two partitions, to see if the overall cut value is reduced. In the event that the cut value is reduced, then the interchange is performed. Otherwise, we pick another pair of vertices in order to perform the interchange. This process is repeated until we converge

8 282 MANAGING AND MINING GRAPH DATA to a optimal solution. We note that this optimum may not be a global optimum, but may only be a local optimum of the underlying data. The main variation in different versions of the Kerninghan-Lin algorithm is the policy which is used for performing the interchanges on the vertices. Some examples of strategies which may be used in order to perform the interchange are as follows: We randomly pick a pair of vertices and perform the interchange, if it improves the underlying solution quality. We test all possible vertex-pair interchanges (or a sample of possible interchanges), and pick the interchange which improves the solution by the greatest amount. A k-interchange is one in which a sequence of k interchanges are performed at one time. We can test any k-interchange and perform it, if it improves the underlying solution quality. We can pick the optimal k-interchange from a sample of possibilities. We note that the use of more sophisticated strategies allows a better improvement in the objective function for each interchange, but also requires more time for each interchange. For example, the determination of an optimal kinterchange requires much more time than a straightforward interchange. This is a natural tradeoff which may work out differently depending upon the nature of the application at hand. Furthermore, the choice of the policy also affects the likelihood of getting stuck at a local optimum. For example, the use of kinterchange techniques are far less likely to result in local optimum for larger values ofk. In fact, by choosing the best interchange across all possible values of k, it is possible to ensure that a global optimum is always reached. On the other hand, it because increasingly difficult to implement the algorithm efficiently with increasing value of k. This is because the time-complexity of the interchange increases exponentially with the value of k. A detailed survey on different methods for optimal graph partitioning may be found in [18]. 2.3 Conventional Generalizations and Network Structure Indices Two well known (and related) techniques for clustering in the context of multi-dimensional data [24] are the k-medoid and k-means algorithms. In the k-medoid algorithm (for multi-dimensional data), we sample a small number of points from the original data as seeds and assign every other data point from the clusters to the closest of these seeds. The closeness may be defined based on a user-defined objective function. The objective function for the clustering is defined as the sum of the corresponding distances of data points to the corresponding seeds. In the next iteration, the algorithm interchanges one of

9 A Survey of Clustering Algorithms for Graph Data 283 the seeds for another randomly selected seed from the data, and checks if the quality of the objective function improves upon performing the interchange. If this is indeed the case, then the interchange is accepted. Otherwise, we do not accept the interchange and try another sample interchange. This process is repeated, until the objective function does not improve over a pre-defined number of interchanges. A closely related method is the k-means method. The main difference with the k-medoid method is that we do not use representative points from the original data after the first iteration of picking the original seeds. In subsequent iterations, we use the centroid of each cluster as the seed set for the next iteration. This process is repeated until the cluster membership stabilizes. A method has been proposed in [35], which uses characteristics of both 2 the k-means and k-medoids algorithms. As in the case of the conventional partitioning algorithms, it picks k graph nodes as seeds. The main differences from the conventional algorithms are in terms of computation of distances (for assignment purposes), and in determination of subsequent seeds. A natural distance function for graphs is the geodesic distance, or the smallest number of hops between a pair of nodes. In order to determine the seed set for the next iteration, we compute the local closeness centrality [20] for each cluster, and use the corresponding node as the sample seed. Thus, while this algorithm continues to use seeds from the original data set (as in the k-medoids algorithm), it uses intuitive ideas from the k-means algorithms in order to determine the identity of these seeds. There are some subtle challenges in the use of the graphical versions of distance based clustering algorithms. One challenge is that since distances are integers, it is possible for data points to be equidistant to several seeds. While ties can be resolved by randomly selecting one of the best assignments, this may result in clusterings which do not converge. In order to handle this instability, a more relaxed threshold is imposed on the number of medoids which may change from iteration to iteration. Specifically, a clustering is considered stable, when the change between iterations is below a certain threshold (say 1 to3%). Another challenge is that the computation of geodesic distances can be very challenging. The computational complexity of the all-pairs shortest paths algorithm can be O(n 3 ), where n is the number of nodes. Even pre-storage of all-pairs shortest paths can require O(n 2 ) time. This is computationally not feasible in most practical scenarios, especially when the underlying graphs are large. Even the space-requirement can be infeasible for very large graphs may 2 In [35], the method has been proposed as a generalization of the k-medoid algorithm. However, it actually uses characteristics of both the k-means and k-medoid algorithms, since it uses centrality notions in determination of subsequent seeds.

10 284 MANAGING AND MINING GRAPH DATA not be practical. In order to handle such cases, the method in [36] uses the concept of network-structure indices, which can summarize the behavior of the network by using randomized division into zones. In this case, the graph is divided into multiple zones. The set of zones form a connected, mutually exclusive and exhaustive partitioning of the graph. The partitioning of the graph into zones is accomplished with the use of a competitive flooding algorithm. In this algorithm, we start off with randomly selected seeds which are labeled by zone identification, and randomly select some unlabeled neighbor of a currently labeled node, and add a label which is matching with its current value. This approach is repeated until all nodes have been labeled. We note that while this approach is extremely fast, it may sometimes result in zones which do not reflect locality well. In order to deal with this situation, we use multiple sets of randomly selected partitions. Each of these partitions is considered a dimension. Note that when we use multiple such random partitions, each node becomes distinguishable from other nodes by virtue of its membership. The distance between a node i and a zone containing node j is denoted as ZoneDistance(i, zone(j)), and is defined as the shortest path between node i and any node in zonej. The distance betweeniandj along a particular zone partitioning (or dimension) is approximated as ZoneDistance(i, zone(j)) + ZoneDistance(j, zone(i)). This value is then averaged over all the sets of randomized partitions in order to provide better robustness. It has been shown in [36] that this approach seems to approximate pairwise distances quite well. The key observation is that the value of ZoneDistance(i, zone(j)) can be pre-computed inn q space, whereqis the number of zones. For a small number of zones, this is quite efficient. Upon using r different sets of partitions, the overall space requirement is n q r, which is much smaller than the Ω(n 2 ) space-requirement of all-pairs computation, for typical values of q and r as suggested in [35]. 2.4 The Girvan-Newman Algorithm The Girvan-Newman algorithm [23] is a divisive clustering algorithm, which is based on the concept of edge betweenness centrality. Betweenness centrality attempts to identify edges which form critical bridges between different connected components, and delete them, until a natural set of clusters remains. Formally, betweenness centrality is defined as the proportion of shortest paths between nodes which pass through a certain edge. Therefore, for a given edge e, we define the betweenness centrality B(e) as follows: B(e) = NumConstrainedPaths(e,i,j) N umshortp aths(i, j) (9.3)

11 A Survey of Clustering Algorithms for Graph Data 285 HereNumConstrainedPaths(e,i,j) refers to the number of (global) shortest paths between i and j which pass through e, and NumShortPaths(i,j) refers to the number of shortest paths between i and j. Note that the value of NumConstrainedPaths)(e,i,j) may be 0 if none of the shortest paths between i and j contain e. The algorithm ranks the edges by order of their betweenness, and and deletes the edge with the highest score. The betweenness coefficients are recomputed, and the process is repeated. The set of connected components after repeated deletion form the natural clusters. A variety of termination-criteria (eg. fixing the number of connected components) can be used in conjunction with the algorithm. A key issue is the efficient determination of edge-betweenness centrality. The number of paths between any pair of nodes can be exponentially large, and it would seem that the computation of the betweenness measure would be a key bottleneck. It has been shown in [36], that the network structure index can also be used in order to estimate edge-betweenness centrality effectively by pairwise node sampling. 2.5 The Spectral Clustering Method Eigenvector techniques are often used in multi-dimensional data in order to determine the underlying correlation structure in the data. It is natural to question as to whether such techniques can also be used for the more general case of graph data. It turns out that this is indeed possible with the use of a method called spectral clustering. In the spectral clustering method, we make use of the node-node adjacency matrix of the graph. For a graph containingn nodes, let us assume that we have an n adjacency matrix, in which the entry(i,j) correspond to the weight of the edge between the nodes i and j. This essentially corresponds to the similarity between nodesiandj. This entry is denoted byw ij, and the corresponding matrix is denoted by W. This matrix is assumed to be symmetric, since we are working with undirected graphs. Therefore, we assume that w ij = w ji for any pair (i,j). All diagonal entries of the matrix W are assumed to be 0. As discussed earlier, the aim of any node partitioning algorithm is minimize (a function of) the weights across the partitions. The spectral clustering method constructs this minimization function in terms of the matrix structure of the adjacency matrix, and another matrix which is referred to as the degree matrix. Thedegree matrix D is simply a diagonal matrix, in which all entries are zero except for the diagonal values. The diagonal entry d ii is equal to the sum of the weights of the incident edges. In other words, the entry d ij is defined as follows: d ij = n j=1 w ij i = j 0 i = j

12 286 MANAGING AND MINING GRAPH DATA We formally define the Laplacian Matrix as follows: Definition 9.2. (Laplacian Matrix) The Laplacian Matrix L is defined by subtracting the weighted adjacency matrix from the degree matrix. In other words, we have: L = D W (9.4) This matrix encodes the structural behavior of the graph effectively and its eigenvector behavior can be used in order to determine the important clusters in the underlying graph structure. We can be shown that the Laplacian matrixl is positive semi-definite i.e., for any n-dimensional row vector f = [f 1...f n ] we have f L f T 0. This can be easily shown by expressing L in terms of its constituent entries which are a function of the corresponding weights w ij. Upon expansion, it can be shown that: We summarize as follows. f L f T = (1/2) n i=1 j=1 n w ij (f i f j ) 2 (9.5) Lemma 9.3. The Laplacian matrix L is positive semi-definite. Specifically, for any n-dimensional row vector f = [f 1...f n ], we have: f L f T = (1/2) n i=1 j=1 n w ij (f i f j ) 2 At this point, let us examine some interpretations of the vector f in terms of the underlying graph partitioning. Let us consider the case in which each f i is drawn from the set {0,1}, and this determines a two-way partition by labeling each node either 0 or 1. The particular partition to which the node i belongs is defined by the corresponding label. Note that the expansion of the expression f L f T from Lemma 9.3 simply represents the sum of the weights of the edges across the partition defined byf. Thus, the determination of an appropriate value of f for which the function f L f T is minimized also provides us with a good node partitioning. Unfortunately, it is not easy to determine the discrete values of f which determine this optimum partitioning. Nevertheless, we will see later in this section that even when we restrict f to real values, this provides us with the intuition necessary to create an effective partitioning. An immediate observation is that the indicator vector f = [1...1] is an eigenvector with a corresponding eigenvalue of 0. We note that f = [1...1] must be an eigenvector, sincelis positive semi-definite andf L f T can be 0 only for eigenvectors with 0 eigenvalues. This observation can be generalized

13 A Survey of Clustering Algorithms for Graph Data 287 further in order to determine the number of connected components in the graph. We make the following observation. Lemma 9.4. The number of (linearly independent) eigenvectors with zero eigenvalues for the Laplacian matrix L is equal to the number of connected components in the underlying graph. Proof: Without loss of generality, we can order the vertices corresponding to the particular connected component that they belong to. In this case, the Laplacian matrix takes on the following block form, which is illustrated below for the case of three connected components. L =L L L 3 Each of the blocks L 1, L 2 and L 3 is a Laplacian itself of the corresponding component. Therefore, the corresponding indicator vector for that component is an eigenvector with corresponding eigenvalue 0. The result follows. We observe that connected components are the most obvious examples of clusters in the graph. Therefore, the determination of eigenvectors corresponding to zero eigenvalues provides us information about this (relatively rudimentary set) of clusters. Broadly speaking, it may not be possible to glean such clean membership behavior from the other eigenvectors. One of the problems is that other than this particular rudimentary set of eigenvectors (which correspond to the connected components), the vector components of the other eigenvectors are drawn from the real domain rather than the discrete {0,1} domain. Nevertheless, because of the nature of the natural interpretation off L f T in terms of the weights of the edges across nodes with very differing values off i, it is natural to cluster together nodes for which the values of f i are as similar as possible across any particular eigenvector on the average. This provides us with the intuition necessary to define an effective spectral clustering algorithm, which partitions the data set into k clusters for any arbitrary value of k. The algorithm is as follows: Determine the k eigenvectors with the smallest eigenvalues. Note that each eigenvector has as many components as the number of nodes. Let the component of the jth eigenvector for theith node be denoted byp ij. Create a new data set with as many records as the number of nodes. The ith record in this data set corresponds to the ith node, and has k components. The record for this node is simply the eigenvector components for that node, which are denoted by p i1...p ik.

14 288 MANAGING AND MINING GRAPH DATA Since we would like to cluster nodes with similar eigenvector components, we use any conventional clustering algorithm (e.g. k-means) in order to create k clusters from this data set. Note that the main focus of the approach was to create a transformation of a structural clustering algorithm into a more conventional multi-dimensional clustering algorithm, which is easy to solve. The particular choice of the multi-dimensional clustering algorithm is orthogonal to the broad spectral approach. The above algorithm provides a broad framework for the spectral clustering algorithm. The input parameter for the above algorithm is the number of clusters k. In practice, a number of variations are possible in order to tune the quality of the clusters which are found. Some examples are as follows: It is not necessary to use the same number of eigenvectors as the input parameter for the number of clusters. In general, one should use at least as many eigenvectors as the number of clusters to be created. However, the exact number of eigenvectors to be used in order to get the optimum results may vary with the particular data set. This can be known only with experimentation. There are other ways of creating normalized Laplacian matrices which can provide more effective results in some situations. Some classic examples of such Laplacian matrices in terms of the adjacency matrix W, degree matrix D and the identity matrix I are defined as follows: L A = I D (1/2) W D (1/2) L B = I D 1 W More details on the different methods which can be used for effective spectral graph clustering may be found in [9]. 2.6 Determining Quasi-Cliques A different way of determining massive graphs in the underlying data is that of determining quasi-cliques. This technique is different from many other partitioning algorithms, in that it focuses on definitions which maximize edge densities within a partition, rather than minimizing edge densities across partitions. A clique is a graph in which every pair of nodes are connected by an edge. A quasi-clique is a relaxation on this concept, and is defined by imposing a lower bound on the degree of each vertex in the given set of nodes. Specifically, we define aγ-quasiclique is as follows: Definition 9.5. A k-graph (k 1) G is a γ-quasiclique if the degree of each node in the corresponding sub-graph of vertices is at least γ k.

15 A Survey of Clustering Algorithms for Graph Data 289 The value ofγ always lies in the range(0,1]. We note that by choosing γ = 1, this definition reverts to that of standard cliques. Choosing lower values of γ allows for the relaxations which are more true in the case of real applications. This is because we rarely encounter complete cliques in real applications, and at least some edges within a dense subgraph would always be missing. A vertex is said to be critical, if its degree in the corresponding subgraph is the smallest integer which is at least equal to γ k. The earliest piece of work on this problem is from [1] The work in [1] uses a greedy randomized adaptive search algorithm GRASP, to find a quasi-clique with the maximum size. A closely related problem is that of finding finding frequently occurring cliques in multiple data sets. In other words, when multiple graphs are obtained from different data sets, some dense subgraphs occur frequently together in the different data sets. Such graphs help in determining important dense patterns of behavior in different data sources. Such techniques find applicability in mining important patterns in graphical representations of customers. The techniques are also helpful in mining cross-graph quasi-cliques in gene expression data. A description of the application of the technique to the problem of gene-expression data may be found in [33]. An efficient algorithm for determining cross graph quasi-cliques was proposed in [32]. The main restriction of the work proposed in [32] is that the support threshold for the algorithms is assumed to be 100%. This restriction has been relaxed in subsequent work [43]. The work in [43] examines the problem of mining frequent closed quasi-cliques from a graph database with arbitrary support thresholds. In [31] a multi-graph version of the quasi-clique problem was explored. However, instead of finding the complete set of quasi-cliques in the graph, they proposed an approximation algorithm to cover all the vertices in the graph with a minimum number of p-quasi-complete subgraphs. Thus, this technique is more suited for summarization of the overall graph with a smaller number of densely connected subgraphs. 2.7 The Case of Massive Graphs A closely related problem is that of dense subgraph determination in massive graphs. This problem is frequently encountered in large graph data sets. For example, the problem of determining large subgraphs of web graphs was studied in [5, 22]. A min-hash approach was first used in [5] in order to determine syntactically related clusters. This paper also introduces the advantages of using a min-hash approach in the context of graph clustering. Subsequently, the approach was generalized to the case of large dense graphs with the use of recursive application of the basic min-hash algorithm. The broad idea in the min-hash approach is to represent the outlinks of a particular node as sets. Two nodes are considered similar, if they share many

16 290 MANAGING AND MINING GRAPH DATA outlinks. Thus, consider a node A with an outlink set S A and a node B with outlink set S B. Then the similarity between the two nodes is defined by the Jaccard coefficient, which is defined as S A S B S A S B. We note that explicit enumeration of all the edges in order to compute this can be computationally inefficient. Rather, a min-hash approach is used in order to perform the estimation. This min-hash approach is as follows. We sort the universe of nodes in a random order. For any set of nodes in random sorted order, we determine the first node First(A) for which an outlink exists from A to First(A). We also determine the first nodefirst(b) for which an outlink exists fromb tofirst(b). It can be shown that the Jaccard coefficient is an unbiased estimate of the probability thatfirst(a) andfirst(b) are the same node. By repeating this process over different permutations over the universe of nodes, it is possible to accurately estimate the Jaccard coefficient. This is done by using a constant number of permutations c of the node order. The actual permutations are implemented by associated c different randomized hash values with each node. This creates c sets of hash values of size n. The sort-order for any particular set of hash-values defines the corresponding permutation order. For each such permutation, we store the minimum node index of the outlink set. Thus, for each node, there are c such minimum indices. This means that, for each node, a fingerprint of size c can be constructed. By comparing the fingerprints of two nodes, the Jaccard coefficient can be estimated. This approach can be further generalized with the use of everyselement set contained entirely withs A and S B. Thus, the above description is the special case when s is set to 1. By using different values of s and c, it is possible to design an algorithm which distinguishes between two sets that are above or below a certain threshold of similarity. The overall technique in [22] first generates a set of c shingles of size s for each node. The process of generating the c shingles is extremely straightforward. Each node is processed independently. We use the min-wise hash function approach in order to generate subsets of size s from the outlinks at each node. This results in c subsets for each node. Thus, for each node, we have a set ofcshingles. Thus, if the graph contains a total ofnnodes, the total size of this shingle fingerprint isn c sp, wheresp is the space required for each shingle. Typically sp will be O(s), since each shingle contains s nodes. For each distinct shingle thus created, we can create a list of nodes which contain it. In general, we would like to determine groups of shingles which contain a large number of common nodes. In order to do so, the method in [22] performs a second-order shingling in which the meta-shingles are created from the shingles. Thus, this further compresses the graph in a data structure of size c c. This is essentially a constant size data structure. We note that this group of meta-shingles have the the property that they contain a large num-

17 A Survey of Clustering Algorithms for Graph Data 291 ber of common nodes. The dense subgraphs can then be extracted from these meta-shingles. More details on this approach may be found in [22]. The min-hash approach is frequently used for graphs which are extremely large and cannot be easily processed by conventional quasi-clique mining algorithms. Since the min-hash approach summarizes the massive graph in a small amount of space, it is particularly useful in leveraging the small space representation for a variety of query-processing techniques. Examples of such applications include the web graph and social networks. In the case of web graphs, we desire to determine closely connected clusters of web pages with similar content. The related problem in social networks is that of finding closely related communities. The min-hash approach discussed in [5, 22] precisely helps us achieve this goal, because we can process the summarized min-hash structure in a variety of ways in order to extract the important communities from the summarized structure. More details of this approach may be found in [5, 22]. 3. Clustering Graphs as Objects In this section, we will discuss the problem of clustering entire graphs in a multi-graph database, rather than examining the node clustering problem within a single graph. Such situations are often encountered in the context of XML data, since each XML document can be regarded as a structural record, and it may be necessary to create clusters from a large number of such objects. We note that XML data is quite similar to graph data in terms of how the data is organized structurally. The attribute values can be treated as graph labels and the corresponding semi-structural relationships as the edges. In has been shown in [2, 10, 28, 29] that this structural behavior can be leveraged in order to create effective clusters. 3.1 Extending Classical Algorithms to Structural Data Since we are examining entre graphs in this version of the clustering problem, the problem simply boils down to that of clustering arbitrary objects, where the objects in this case have structural characteristics. Many of the conventional algorithms discussed in [24] (such as k-means type partitional algorithms and hierarchical algorithms can be extended to the case of graph data. The main changes required in order to extend these algorithms are as follows: Most of the underlying classical algorithms typically use some form of distance function in order to measure similarity. Therefore, we need appropriate measures in order to define similarity (or distances) between structural objects.

18 292 MANAGING AND MINING GRAPH DATA Many of the classical algorithms (such as k-means) use representative objects such as centroids in critical intermediate steps. While this is straightforward in the case of multi-dimensional objects, it is much more challenging in the case of graph objects. Therefore, appropriate methods need to be designed in order to create representative objects. Furthermore, in some cases it may be difficult to create representatives in terms of single objects. We will see is that it is often more robust to use representative summaries of the underlying objects. There are two main classes of conventional techniques, which have been extended to the case of structural objects. These techniques are as follows: Structural Distance-based Approach: This approach computes structural distances between documents and uses them in order to compute clusters of documents. One of the earliest work on clustering tree structured data is the XClust algorithm [28], which was designed to cluster XML schemas in order for efficient integration of large numbers of Document Type Definitions (DTDs) of XML sources. It adopts the agglomerative hierarchical clustering method which starts with clusters of single DTDs and gradually merges the two most similar clusters into one larger cluster. The similarity between two DTDs is based on their element similarity, which can be computed according to the semantics, structure, and context information of the elements in the corresponding DTDs. One of the shortcomings of the XClust algorithm is that it does not make full use of the structure information of the DTDs, which is quite important in the context of clustering tree-like structures. The method in [7] computes similarity measures based on the structural edit-distance between documents. This edit-distance is used in order to compute the distances between clusters of documents. S-GRACE is hierarchical clustering algorithm [29]. In [29], an XML document is converted to a structure graph (or s-graph), and the distance between two XML documents is defined according to the number of the common element-subelement relationships, which can capture better structural similarity relationships than the tree edit distance in some cases [29]. Structural Summary Based Approach: In many cases, it is possible to create summaries from the underlying documents. These summaries are used for creating groups of documents which are similar to these summaries. The first summary-based approach for clustering XML documents was presented in [10]. In [10], the XML documents are modeled as rooted ordered labeled trees. A framework for clustering XML documents by using structural summaries of trees is presented. The aim is to improve algorithmic efficiency without compromising cluster quality.

19 A Survey of Clustering Algorithms for Graph Data 293 A second approach for clustering XML documents is presented in [2]. This technique is a partition-based algorithm. The primary idea in this approach is to use frequent-pattern mining algorithms in order to determine the summaries of frequent structures in the data. The technique uses a k-means type approach in which each cluster center comprises a set of frequent patterns which are local to the partition for that cluster. The frequent patterns are mined using the documents assigned to a cluster center in the last iteration. The documents are then further reassigned to a cluster center based on the average similarity between the document and the newly created cluster centers from the local frequent patterns. In each iteration the document-assignment and the mined frequent patterns are iteratively re-assigned, until the cluster centers and document partitions converge to a final state. It has been shown in [2] that such a structural summary based approach is significantly superior to a similarity function based approach as presented in [7]. The method of also superior to the structural approach in [10] because of its use of more robust representations of the underlying structural summaries. Since the most recent algorithm is the structural summary method discussed in [2], we will discuss this in more detail in the next section. 3.2 The XProj Approach In this section, we will present XProj, which is a summary-based approach for clustering of XML documents. The pseudo-code for clustering of XML documents is illustrated in Figure 9.1. The primary approach is to use a substructural modification of a partition based approach in which the clusters of documents are built around groups of representative sub-structures. Thus, instead of a single representative of a partition-based algorithm, we use a substructural set representative for the structural clustering algorithm. Initially, the document set D is randomly divided into k partitions with equal size, and the sets of sub-structure representatives are generated by mining frequent substructures of size l from these partitions. In each iteration, the sub-structural representatives (of a particular size, and a particular support level) of a given partition are the frequent structures from that partition. These structural representatives are used to partition the document collection and vice-versa. We note that this can be a potentially expensive operation because of the determination of frequent substructures; in the next section, we will illustrate an interesting way to speed it up. In order to actually partition the document collection, we calculate the number of nodes in a document which are covered by each sub-structural set representative. A larger coverage corresponds to a greater level of similarity. The aim of this approach is that the algorithm will determine the most important localized sub-structures over time. This

Community Detection. Community

Community Detection. Community Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,

More information

Web Structure Mining Community Detection and Evaluation

Web Structure Mining Community Detection and Evaluation Web Structure Mining Community Detection and Evaluation 1 Community Community. It is formed by individuals such that those within a group interact with each other more frequently than with those outside

More information

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval Charu C. Aggarwal 1, Haixun Wang # IBM T. J. Watson Research Center Hawthorne, NY 153, USA 1 charu@us.ibm.com # Microsoft Research

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

V4 Matrix algorithms and graph partitioning

V4 Matrix algorithms and graph partitioning V4 Matrix algorithms and graph partitioning - Community detection - Simple modularity maximization - Spectral modularity maximization - Division into more than two groups - Other algorithms for community

More information

Social-Network Graphs

Social-Network Graphs Social-Network Graphs Mining Social Networks Facebook, Google+, Twitter Email Networks, Collaboration Networks Identify communities Similar to clustering Communities usually overlap Identify similarities

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Spectral Clustering and Community Detection in Labeled Graphs

Spectral Clustering and Community Detection in Labeled Graphs Spectral Clustering and Community Detection in Labeled Graphs Brandon Fain, Stavros Sintos, Nisarg Raval Machine Learning (CompSci 571D / STA 561D) December 7, 2015 {btfain, nisarg, ssintos} at cs.duke.edu

More information

SGN (4 cr) Chapter 11

SGN (4 cr) Chapter 11 SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Clustering Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 19 Outline

More information

Lecture Note: Computation problems in social. network analysis

Lecture Note: Computation problems in social. network analysis Lecture Note: Computation problems in social network analysis Bang Ye Wu CSIE, Chung Cheng University, Taiwan September 29, 2008 In this lecture note, several computational problems are listed, including

More information

My favorite application using eigenvalues: partitioning and community detection in social networks

My favorite application using eigenvalues: partitioning and community detection in social networks My favorite application using eigenvalues: partitioning and community detection in social networks Will Hobbs February 17, 2013 Abstract Social networks are often organized into families, friendship groups,

More information

Clusters and Communities

Clusters and Communities Clusters and Communities Lecture 7 CSCI 4974/6971 22 Sep 2016 1 / 14 Today s Biz 1. Reminders 2. Review 3. Communities 4. Betweenness and Graph Partitioning 5. Label Propagation 2 / 14 Today s Biz 1. Reminders

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A 4 credit unit course Part of Theoretical Computer Science courses at the Laboratory of Mathematics There will be 4 hours

More information

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the

More information

Chapter 5 Graph Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn

Chapter 5 Graph Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Chapter 5 Graph Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Graphs Extremely important concept in computer science Graph, : node (or vertex) set : edge set Simple graph: no self loops, no multiple

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy

Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, Roma, Italy Graph Theory for Modelling a Survey Questionnaire Pierpaolo Massoli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy e-mail: pimassol@istat.it 1. Introduction Questions can be usually asked following specific

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

LECTURES 3 and 4: Flows and Matchings

LECTURES 3 and 4: Flows and Matchings LECTURES 3 and 4: Flows and Matchings 1 Max Flow MAX FLOW (SP). Instance: Directed graph N = (V,A), two nodes s,t V, and capacities on the arcs c : A R +. A flow is a set of numbers on the arcs such that

More information

Clustering Algorithms for general similarity measures

Clustering Algorithms for general similarity measures Types of general clustering methods Clustering Algorithms for general similarity measures general similarity measure: specified by object X object similarity matrix 1 constructive algorithms agglomerative

More information

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible

More information

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007 CS880: Approximations Algorithms Scribe: Chi Man Liu Lecturer: Shuchi Chawla Topic: Local Search: Max-Cut, Facility Location Date: 2/3/2007 In previous lectures we saw how dynamic programming could be

More information

Modularity CMSC 858L

Modularity CMSC 858L Modularity CMSC 858L Module-detection for Function Prediction Biological networks generally modular (Hartwell+, 1999) We can try to find the modules within a network. Once we find modules, we can look

More information

On the Approximability of Modularity Clustering

On the Approximability of Modularity Clustering On the Approximability of Modularity Clustering Newman s Community Finding Approach for Social Nets Bhaskar DasGupta Department of Computer Science University of Illinois at Chicago Chicago, IL 60607,

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008 Lesson 4 Random graphs Sergio Barbarossa Graph models 1. Uncorrelated random graph (Erdős, Rényi) N nodes are connected through n edges which are chosen randomly from the possible configurations 2. Binomial

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates

Lecture 11: Clustering and the Spectral Partitioning Algorithm A note on randomized algorithm, Unbiased estimates CSE 51: Design and Analysis of Algorithms I Spring 016 Lecture 11: Clustering and the Spectral Partitioning Algorithm Lecturer: Shayan Oveis Gharan May nd Scribe: Yueqi Sheng Disclaimer: These notes have

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Algorithms for Grid Graphs in the MapReduce Model

Algorithms for Grid Graphs in the MapReduce Model University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Dissertations, and Student Research Computer Science and Engineering, Department

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem L. De Giovanni M. Di Summa The Traveling Salesman Problem (TSP) is an optimization problem on a directed

More information

Spectral Methods for Network Community Detection and Graph Partitioning

Spectral Methods for Network Community Detection and Graph Partitioning Spectral Methods for Network Community Detection and Graph Partitioning M. E. J. Newman Department of Physics, University of Michigan Presenters: Yunqi Guo Xueyin Yu Yuanqi Li 1 Outline: Community Detection

More information

Lecture 11: Maximum flow and minimum cut

Lecture 11: Maximum flow and minimum cut Optimisation Part IB - Easter 2018 Lecture 11: Maximum flow and minimum cut Lecturer: Quentin Berthet 4.4. The maximum flow problem. We consider in this lecture a particular kind of flow problem, with

More information

Edge Classification in Networks

Edge Classification in Networks Charu C. Aggarwal, Peixiang Zhao, and Gewen He Florida State University IBM T J Watson Research Center Edge Classification in Networks ICDE Conference, 2016 Introduction We consider in this paper the edge

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

A Framework for Clustering Massive-Domain Data Streams

A Framework for Clustering Massive-Domain Data Streams A Framework for Clustering Massive-Domain Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center 19 Skyline Drive, Hawthorne, NY, USA charu@us.ibm.com Abstract In this paper, we will examine the

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Network Clustering. Balabhaskar Balasundaram, Sergiy Butenko

Network Clustering. Balabhaskar Balasundaram, Sergiy Butenko Network Clustering Balabhaskar Balasundaram, Sergiy Butenko Department of Industrial & Systems Engineering Texas A&M University College Station, Texas 77843, USA. Introduction Clustering can be loosely

More information

by conservation of flow, hence the cancelation. Similarly, we have

by conservation of flow, hence the cancelation. Similarly, we have Chapter 13: Network Flows and Applications Network: directed graph with source S and target T. Non-negative edge weights represent capacities. Assume no edges into S or out of T. (If necessary, we can

More information

Non-exhaustive, Overlapping k-means

Non-exhaustive, Overlapping k-means Non-exhaustive, Overlapping k-means J. J. Whang, I. S. Dhilon, and D. F. Gleich Teresa Lebair University of Maryland, Baltimore County October 29th, 2015 Teresa Lebair UMBC 1/38 Outline Introduction NEO-K-Means

More information

On Biased Reservoir Sampling in the Presence of Stream Evolution

On Biased Reservoir Sampling in the Presence of Stream Evolution Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Algebraic Graph Theory- Adjacency Matrix and Spectrum

Algebraic Graph Theory- Adjacency Matrix and Spectrum Algebraic Graph Theory- Adjacency Matrix and Spectrum Michael Levet December 24, 2013 Introduction This tutorial will introduce the adjacency matrix, as well as spectral graph theory. For those familiar

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Parameterized graph separation problems

Parameterized graph separation problems Parameterized graph separation problems Dániel Marx Department of Computer Science and Information Theory, Budapest University of Technology and Economics Budapest, H-1521, Hungary, dmarx@cs.bme.hu Abstract.

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition. 18.433 Combinatorial Optimization Matching Algorithms September 9,14,16 Lecturer: Santosh Vempala Given a graph G = (V, E), a matching M is a set of edges with the property that no two of the edges have

More information

A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS

A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS Chapter 9 A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS Charu C. Aggarwal IBM T. J. Watson Research Center Hawthorne, NY 10532 charu@us.ibm.com Philip S. Yu IBM T. J. Watson Research Center Hawthorne,

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Centrality Book. cohesion.

Centrality Book. cohesion. Cohesion The graph-theoretic terms discussed in the previous chapter have very specific and concrete meanings which are highly shared across the field of graph theory and other fields like social network

More information

Structured System Theory

Structured System Theory Appendix C Structured System Theory Linear systems are often studied from an algebraic perspective, based on the rank of certain matrices. While such tests are easy to derive from the mathematical model,

More information

MCL. (and other clustering algorithms) 858L

MCL. (and other clustering algorithms) 858L MCL (and other clustering algorithms) 858L Comparing Clustering Algorithms Brohee and van Helden (2006) compared 4 graph clustering algorithms for the task of finding protein complexes: MCODE RNSC Restricted

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

6 Randomized rounding of semidefinite programs

6 Randomized rounding of semidefinite programs 6 Randomized rounding of semidefinite programs We now turn to a new tool which gives substantially improved performance guarantees for some problems We now show how nonlinear programming relaxations can

More information

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Abstract Distance function computation

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Network optimization: An overview

Network optimization: An overview Network optimization: An overview Mathias Johanson Alkit Communications 1 Introduction Various kinds of network optimization problems appear in many fields of work, including telecommunication systems,

More information

Complexity Results on Graphs with Few Cliques

Complexity Results on Graphs with Few Cliques Discrete Mathematics and Theoretical Computer Science DMTCS vol. 9, 2007, 127 136 Complexity Results on Graphs with Few Cliques Bill Rosgen 1 and Lorna Stewart 2 1 Institute for Quantum Computing and School

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters Types of general clustering methods Clustering Algorithms for general similarity measures agglomerative versus divisive algorithms agglomerative = bottom-up build up clusters from single objects divisive

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 23 November 2018 Overview Unsupervised Machine Learning overview Association

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

CSCI5070 Advanced Topics in Social Computing

CSCI5070 Advanced Topics in Social Computing CSCI5070 Advanced Topics in Social Computing Irwin King The Chinese University of Hong Kong king@cse.cuhk.edu.hk!! 2012 All Rights Reserved. Outline Graphs Origins Definition Spectral Properties Type of

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park

Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph. Nechama Florans. Mentor: Dr. Boram Park Research Question Presentation on the Edge Clique Covers of a Complete Multipartite Graph Nechama Florans Mentor: Dr. Boram Park G: V 5 Vertex Clique Covers and Edge Clique Covers: Suppose we have a graph

More information