DATA MINING II - 1DL460

Size: px

Start display at page:

Download "DATA MINING II - 1DL460"

Lionel Morton
5 years ago
Views:

1 DATA MINING II - 1DL460 Spring 2017 A second course in data mining Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden Tore Risch- UDBL - IT - UU

2 Cluster validation (Tan, Steinbach, Kumar ch. 8.5) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden Tore Risch- UDBL - IT - UU

3 Cluster validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the goodness of the resulting clusters? But clusters are in the eye of the beholder! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters,

4 Clusters found in random data Random Points y x ydbscan x K- means y y Complete Link x x

5 Different aspects of cluster validation 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Determining the correct number of clusters. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 5. Comparing the results of two different sets of cluster analyses to determine which is better. Note: 1, 2 and 3 make no use of external information while 4 require external information. 5 might and might not use external information. For 3, 4 and 5, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters

6 Measures of cluster validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. Internal index (unsupervised evaluation): Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) External index (supervised evaluation): Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Relative index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy Sometimes these are referred to as criteria instead of indices However, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion

7 Internal measures: measuring cluster validity via correlation Two matrices P is the proximity matrix where P(i,j) = d(xi,xj) i.e. the proximity/similarity between xi and xj Q is the incidence matrix or cluster distance matrix where Q(i,j) = d(cxi,cxj ), i.e.: one row and one column for each data point an entry is 1 if the associated pair of points belong to the same cluster an entry is 0 if the associated pair of points, belongs to different clusters Compute the correlation between the two matrices Since the matrices are symmetric, only the correlation between n (n-1) / 2 entries needs to be calculated. The modified Hubert Γ statistic calculates the correlation between P and Q. Γ D = 1 M n 1 n i=1 j=i+1 Q(i, j)p(i, j), where M = n(n-1)/2, the no of pairs of different points. High correlation indicates that points that belong to the same cluster are close to each other. Works best for globular clusters but not a good measure for some density or contiguity based clusters

8 Internal measures: measuring cluster validity via correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. y x y x Corr = Corr =

Using similarity matrix for cluster validation Order the similarity matrix with respect to cluster labels and inspect visually. sampling can be applied to large data sets y 1 0.

9 Using similarity matrix for cluster validation Order the similarity matrix with respect to cluster labels and inspect visually. sampling can be applied to large data sets y x Points Similarity 0 Points

Using similarity matrix for cluster validation Clusters in random data are not so crisp 1 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.

10 Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x DBSCAN

11 Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x K-means

12 Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x Complete Link

Using similarity matrix for cluster validation Clusters of non-globular clusterings are not as clearly separated 1 0.

13 Using similarity matrix for cluster validation Clusters of non-globular clusterings are not as clearly separated DBSCAN

14 Internal measures: SSE Internal index: Used to measure the goodness of a clustering structure without respect to external information SSE SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of clusters SSE K

15 Internal measures: SSE Clusters in more complicated figures are sometimes not so well separated SSE curve for a more complicated data set where elbows are not as clearly identified SSE of clusters found using K-means

16 Unsupervised cluster validity measure More generally, given K clusters: Validity(C i ): a function of cohesion, separation, or both Weight w i associated with each cluster i For SSE: w i = 1 ( ) validity(c i ) = x µ 2 i x C i

17 Internal measures: Cohesion and Separation Cluster cohesion: measures how closely related objects are in a cluster Cluster separation: measures how distinct or well-separated a cluster is from other clusters

18 Internal measures: Cohesion and Separation A proximity graph-based approach can be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation

19 Internal measures: Cohesion and Separation A prototype-based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the proximities with respect to the prototype (centroid or medoid) of the cluster. Cluster separation is measured by the proximity of the cluster prototypes. cohesion separation

20 Graph-based versus Prototype-based views

21 Graph-based view Cluster cohesion: measures how closely related objects are in a cluster Cohesion( C i ) = x C, y C i i proximity ( x, y) Cluster separation: measures how distinct or well-separated a cluster is from other clusters Separation ( C i, C j ) = x C, y C i j proximity( x, y)

22 Prototype-based view Cluster Cohesion: Cohesion ( C ) = proximity ( x, ) i c i x Equivalent to SSE if proximity is square of Euclidean distance C i Cluster Separation: Separation( C Separation( C i i, C j ) = ) = proximity ( c proximity ( c i i, c), c j )

23 Unsupervised cluster validity measures

24 Prototype-based vs Graph-based cohesion It can be shown that the prototype-based and graph-based approaches are equivalent in some proximity measures: For SSE and points in Euclidean space, it can be shown that average pairwise difference between points in a cluster is equivalent to SSE of the cluster

25 Total Sum of Squares (TSS) When using Euclidean distance as proximity measure, the traditional measure of separation between clusters is the between group sum of squares (SSB) c 2 c 1 c c: overall mean c i : centroid of each cluster C i c 3 m i : number of points in cluster C i TSS = SSE= SSB = k i= 1 x C k i= 1 dist( x, c) i m dist( c i dist( x, c i 2 i ), c)

26 Total Sum of Squares (TSS) SSE: sum of the total sum of squared error, i.e. SSE for all clusters. SSB: the between group sum of squares summed over all clusters. It can be shown that the total sum of squares (TSS) is constant, so given a data set, TSS is fixed: TSS = SSE + SSB A clustering with large SSE has small SSB, while one with small SSE has large SSB Goal is to minimize SSE and maximize SSB

27 Total Sum of Squares (TSS) m 1 m m 2 5 K=1 cluster: K=2 clusters: TSS = SSE + SSB TSS = (1 3) SSE= (3 1) SSB = 4 (3 3) TSS = (1 3) SSB= 2 (3 1.5) SSE= (1 1.5) + (2 3) + (3 2) + (2 3) (2 1.5) 2 = (4 3) + (4 3) + (4 3) 2 + (4 4.5) + 2 (4.5 3) (5 3) + (5 3) + (5 3) = = 10 = 10 = 10 + (5 4.5) 2 =

28 Internal Measures: Silhouette coefficient Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = minimum of (average distance of i to points in another cluster) The silhouette coefficient for a point is then given by s i = (b i a i) / max(a i,b i ), where s i [-1,1] (simplified expression: s = 1 a/b if a < b, (or s = b/a - 1 if a b, not the usual case)) Typically between 0 and 1. The closer to 1 the better. a b Can calculate the Average silhouette width for a cluster or a clustering

29 Example silhouette coefficient

30 Correct number of clusters? Visually looking for knees, peaks or dips in the plot of an evaluation measure plotted against no of clusters can indicate the natural number of clusters Does not always work well, e.g. when clusters are overlapping, intertwined or not clearly separated There is a distinct knee in the SSE and a distinct peak in the silhouette coefficient

31 Clustering tendency Methods for evaluation if data has clusters without clustering Most common approach in Euclidian space is to apply a statistical test for spatial randomness can be quite challenging to select correct model and parameters Example of Hopkins statistic (se blackboard example)

32 Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) Distance matrix: Single Link

Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) 0.2 0.15 0.1 0.

33 Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) Single Link Cophenetic Distance Matrix for Single Link CPCC (CoPhenetic Correlation Coefficient) Correlation between original distance matrix and cophenetic distance matrix Cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. CPCC is a standard measure of how well a hierarchical clustering fits the data CPCC is different for different types of hierarchical clusterings

Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) 0.2 Single Link 0.15 0.1 0.

34 Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) 0.2 Single Link Complete Link Comparing CPCC for different clustering techniques:

35 Supervised measures of cluster validity: classification-oriented measures of cluster validity Entropy The degree to which each cluster consists of objects of a single class Purity Another measure of the extent to which a cluster contains objects of a single class Precision The fraction of a cluster that consists of objects of a specific class Recall The extent to which a cluster contains all object of a specified class F-measure A combination of precision and recall that measures the extent to which a cluster contains only objects of a particular class and all objects of that class

36 External measures of cluster validity: Entropy and Purity

37 Supervised measures of cluster validity: similarity-oriented measures of cluster validity A similarity-oriented measure of cluster validity is based on the idea that any two objects in the same cluster should be in the same class. Hence, these measures the extent of which two objects of the same class also are in the same cluster and vice versa. Can be expressed as comparing the two matrices: Ideal cluster similarity matrix that has a 1 in the ij th entry if two objects i and j are in the same cluster and 0 otherwise. Ideal class similarity matrix defined with respect to class labels which has a 1 in the ij th entry if two objects i and j belong to the same class and 0 otherwise. The correlation between these two matrices, called the Γ (gamma) statistic, can be taken as a measure of cluster validity

Supervised measures of cluster validity: similarity-oriented measures of cluster validity Correlation between cluster and class matrices Five data points: p 1, p 2, p 3, p 4 and p 5.

38 Supervised measures of cluster validity: similarity-oriented measures of cluster validity Correlation between cluster and class matrices Five data points: p 1, p 2, p 3, p 4 and p 5. Two clusters: C 1 = {p 1, p 2, p 3 } and C 2 = {p 4, p 5 } Two classes: L 1 = {p 1, p 2 } and L 2 = {p 3, p 4, p 5 } Correlation between these matrices is (0,359 *) 0,1667 ((*) calculation according to Tan book unclear what formula used! See notes on blackboard!)

39 Supervised measures of cluster validity: similarity-oriented measures of cluster validity Cluster validity measures based on the contingency table: Same cluster Different cluster Same class f 11 f 10 Different class f 01 f 00 f 11 : number of pairs of objects having the same class and the same cluster f 01 : number of pairs of objects having a different class and the same cluster f 10 : number of pairs of objects having the same class and a different cluster f 00 : number of pairs of objects having a different class and a different cluster

40 Supervised measures of cluster validity: similarity-oriented measures of cluster validity Cluster validity measures based on the contingency table: Same cluster Different cluster Same class f 11 = 2 f 10 = 2 Different class f 01 = 2 f 00 = 4 Example of two frequently used cluster validity measures are the Rand statistic and the Jaccard coefficient (see example on blackboard)

41 Supervised measures of cluster validity: cluster validity for hierarchical clustering's Supervised evaluation of hierarchical clustering is more difficult E.g. preexisting hierarchical structure might be hard to find The hierarchical F-measure evaluate a hierarchical clustering with respect to a flat set of class labels

42 Supervised measures of cluster validity: cluster validity for hierarchical clustering's The idea of the hierarchical F-measure is to evaluate whether the hierarchical clustering, for each class, contains at least one cluster that is relatively pure and contain most objects of that class. Compute (for each class) the F-measure for each cluster in the hierarchy. Retrieve the maximum F-measure for each class attained for any cluster Calculate the total F-measure as the weighted average of all per-class F-measures Hierarchical F-measure:

43 Significance of cluster validity measures? How to interpret the significance of a calculated evaluation measure? Min/max values gives some guidance However min/max might not be available or scale may affect interpretation Different application might tolerate different values Common solution: to interpret value of validity measure in statistical terms. See following examples:

44 Framework for cluster validity Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? Statistics provide a framework for cluster validity The more atypical a clustering result is, the more likely it represents valid structure in the data Can compare the values of an index that result from random data or clusterings to those of a clustering result. If the value of the index is unlikely, then the cluster results are valid For comparing the results of two different sets of cluster analyses, a framework is less necessary. However, there is the question of whether the difference between two index values is significant

45 Statistical Framework for SSE Example: 2-d data with 100 points y x Suppose a clustering algorithm produces SSE = Does it mean that the clusters are statistically significant?

46 Example continued: Statistical framework for SSE Generate 500 sets of random data points of size 100 distributed over the range for x and y values Perform clustering with k = 3 clusters for each data set Plot histogram of SSE and compare with the value for the 3 well-separated clusters y x Count SSE

47 Statistical framework for correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. y x y x Corr = (statistically significant) Corr = (not statistically significant)

48 Final comment on cluster validity The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering Data, Jain and Dubes

49 Alt. Clustering Techniques Cluster analysis: additional issues and algorithms (Tan, Steinbach, Kumar ch. 9) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden Tore Risch- UDBL - IT - UU

50 Characteristics of data, clusters and clustering algorithms No algorithm suitable for all types of data, clusters, and applications Characteristics of data, clusters, and algorithms strongly impact clustering Important for understanding, describing and comparing clustering techniques

51 Characteristics of data High dimensionality Problem with density and proximity measures Size Dimensionality reduction techniques one approach to address this problem Redefinition of proximity and density is another (section 9.4.5, 9.4.7) Many algorithms have time or space complexity of O(m 2 ), m being the no of objects (section 9.5) Sparseness What type of sparse data? Asymmetric attributes? Magnitude important or just occurences? Noise and outliers How are noise and outliers affecting a specific algorithm? (dealt with in DBSCAN, Chameleon, (section 9.4.4) SNN density (9.4.8), CURE (9.5.3) Types of attributes and data set Scale Different proximity and density measures needed for different types of data (ch 2) Normalization can be required (ch 2.3.7) Mathematical properties of the data space E.g. is mean or density meaningful for your data?

52 Data distribution Characteristics of clusters Some clustering algorithms assume a particular type of distribution of data, e.g. mixture models (9.2.2) Shape Can a specific algorithm handle the expected cluster shapes? (DBSCAN, single-link, Chameleon, CURE) Differing sizes Can a specific algorithm handle varying cluster sizes? (section 9.6) Differing densities Can a specific algorithm handle varying cluster densities? (SNN density) Poorly separated clusters Can a specific algorithm handle overlapping clusters? (fuzzy clustering, 9.2.1) Relationships among clusters E.g. self-organizing maps (SOM) considers the relationships between clusters Subspace clusters Clusters can exist in many distinct subspaces (subspace clustering - DENCLUE)

53 Characteristics of clustering algorithms Order dependence Quality and number of clusters can be order dependent, e.g. SOM Nondeterminism E.g. K-means produces different results for every execution Scalability Required for large data sets Linear or near-linear time and space-complexity preferred O(m 2 ) less practical for large data sets Parameter selection Can be challenging to set parameter values Transforming the clustering problem to another domain E.g. graph partitioning Treating clustering as an optimization problem Exhaustive search is computationally infeasible In reality one need some heuristics that produce good (but maybe not optimal) results

54 Types of clustering algorithms One way of dividing clustering algorithms: Prototype-based clustering algorithms Density-based clustering algorithms Graph-based clustering algorithms Scalable clustering algorithms

55 Prototype-based clustering algorithms K-means centroid-based Fuzzy clustering Objects are allowed to belong to more than one cluster Relies on fuzzy set theory that allows an object to belong to a set with a degree of membership between 0 and 1. Applied in e.g. Fuzzy c-means algorithm (fuzzy version of K-means) Mixture models clustering Clusters modeled as statistical distributions EM (expectation-maximization) algorithm Self-organizing maps (SOM) Clusters constrained to have fixed relationships to neighbours e.g. two-dimensional grid structure

56 Example of Fuzzy c-means clustering

57 Fuzzy c-means clustering pros and cons Do not restrict data points to belong to one cluster instead indicate the degree that a point belong to any cluster Also share some strength and weaknesses (like non-globular shapes) of K-means although more computationally intensive

58 Mixture models clustering Clusters modeled as statistical distributions Exemplified by EM (expectation-maximization) algorithm

59 EM (expectation-maximization) algorithm E-step: prob(distribution j xi, Ɵ) is the probability that a point xi came from a particular (commonly Gaussian) distribution j, (Ɵj where 1 j k, and Ɵj = (μj, σj), calculated using the estimated parameters Ɵ prob ( x / Θ ) prob(distribution j xi, Ɵ) = prob ( x / Θ ) i l=1 j k i l M-step: parameters Ɵ are maximized by, for all n data points, calculating : n 1 i=1 xi prob (Θ j xiθ) µi = k n prob (Θ j xiθ) i=1 n prob Θ x Θ x u σ j = 1k i=1 n ( j i )( i j ) i=1 prob(θ j xiθ)

Gaussian) distribution j, calculated using the estimated parameters

60 EM (expectation-maximization) algorithm prob(distribution j xi, Ɵ) is the probability that a point xi came from a particular (commonly Gaussian) distribution j, calculated using the estimated parameters Ɵ Compare K-means where centroids are updated in each step

61 EM clustering example

62 Additional EM clustering examples

63 EM clustering pros and cons Pros: More general than K-means using various types of distributions Can find clusters of different sizes and elliptical shapes Condensed description of clusters through a small no of parameters Cons: Can be slow for models including a large number of components Does not work well for a low number of data points Problems of estimating the number of clusters or more generally to chose the exact form of the model Noise and outliers might cause problems

Self organizing maps (SOM) The goal of SOM is to find a set of centroids (reference vectors using SOM terminology) and assigning each data point to the centroid that provides the best approximation

64 Self organizing maps (SOM) The goal of SOM is to find a set of centroids (reference vectors using SOM terminology) and assigning each data point to the centroid that provides the best approximation of that data point. In self-organizing maps (SOM), centroids have a predetermined topographic ordering relationship During the training process, SOM uses each data point to update the closest centroid and nearby centroids in the topographic ordering For example, the centroids of a two-dimensional SOM can be viewed as a structure on a 2Dsurface that tries to fit the n-dimensional data as good as possible

65 Kohonen self organizing maps Similarity between centroid nodes and input nodes: Input: X = <x 1,, x h > Weights: <w 1i,, w hi > Similarity defined based on dot product Centroid node i most similar to input wins Winning node weights (as well as surrounding node weights) are increased(*). sim (X, i) = x j w ji, (summation over j=1 to h) w kj = c(x k -w kj ) if j N i, 0 otherwise c indicates learning rate N i union of centroid i and its neighborhood centroids (*) alternative formulation imply that centroids and neighborhood move closer to current input node

66 Example of SOM application In self-organizing maps, clusters are constrained to have fixed relationships to neighbours

67 Example of SOM application In self-organizing maps, clusters are constrained to have fixed relationships to neighbours

68 Illustration of a self organizing map (SOM) Illustration of SOM network where black dots are data and green dots represent SOM nodes (source:

69 Self organizing maps pros and cons Pros: Related clusters are close which facilitate the interpretation and visualization of clustering results Cons: Settings of several parameters, neighborhood function, grid type, no of centroids SOM cluster might not correspond to single natural cluster, it can encompass several natural clusters or a single natural cluster might span several centroids Lacks an objective function that can make it problematic to compare different SOM clustering results Convergence can be slow and is not guaranteed but normally fulfilled

70 Density-based clustering algorithms (1) DBSCAN Core points, border points and noise Grid-based clustering Objects are assigned to grid cells corresponding to attribute intervals Subspace clustering There can be different clusters in different subspaces

71 Density-based clustering algorithms (2) Grid-based clustering:

72 Grid-based clustering pros and cons Pros: Grid-based clustering can be very efficient and effective with a O(m), m no of points, complexity for defining the grid and an overall complexity of O(m log m) Cons: Dependent of the choice of density threshold To high clusters will be lost To low separate clusters may be joined. Differing densities might make it hard to find a single threshold that works for complete data space Rectangular grid may have problems accurately capturing circular boundaries Grid-based clustering tends to work poorly for high-dimensional data

73 Density-based clustering algorithms (3) Subspace clustering: Subspace clustering considers subsets of features Clusters found in subspaces can be quite different Data may be clustered with respect to a small set of attributes but random for others Different clusters might exist in different sets of dimensions

Density-based clustering algorithms (4) CLIQUE (Clustering In QUEst) Grid-based clustering that methodologically finds subspace clusters No of subspaces is exponential in no of dimensions so some

74 Density-based clustering algorithms (4) CLIQUE (Clustering In QUEst) Grid-based clustering that methodologically finds subspace clusters No of subspaces is exponential in no of dimensions so some efficient pruning technique needed CLIQUE relies on the monotonicity property of density-based clusters: if a set of points forms a density-based cluster in k-dimensions, then the same point is also part of a density-based cluster in all possible subsets of those dimensions. Based on well-known Apriori principle from association analysis

75 Density-based clustering algorithms (5) CLIQUE pros and cons Well understood since based on well-known Apriori principle Can summarize the list of cells in clusters with a small set of inequalities Clusters can overlap and make interpretation difficult Potential exponential time complexity to many cells might be generated for lower values of k, i.e. for low dimensions

76 Density-based clustering algorithms (6) DENCLUE (density clustering) Models the overall density of a set of points as the sum of the influence (or kernel) functions associated with each point Based on kernel density estimation, a well-developed area of statistics. Peaks (local density attractors) forms the basis for forming clusters A minimum density threshold, ξ, separates clusters from data points considered to be noise Grid-based implementation defines neighborhood and reduces complexity

77 Density-based clustering algorithms (7) DENCLUE:

78 Density-based clustering algorithms (8) DENCLUE pros and cons Solid theoretical foundation Good at handling noise and outliers and finding clusters of different shapes and sizes Can be computationally expensive Susceptible to accuracy dependence to choice of grid size Problems with high-dimensionality and different densities

79 Graph-based clustering algorithms (1) In graph-based clustering data objects are represented as nodes and proximity as the edge weight between two corresponding nodes. Discussed agglomerative (hierarchical) clustering algorithms in dm1 course. Here some more advanced graph-based clustering algorithms are presented that applies different subsets of the following key approaches: Sparsification of proximity graph keeping only connections of an object to its nearest neighbours Useful for handling noise and outliers Makes it possible to apply efficient graph partitioning algorithms Use information in proximity graph for more sophisticated merging strategies E.g. will merged cluster have similar characteristics as original unmerged clusters Similarity measures between objects based on the no of shared nearest neighbours Useful for handling problems with high dimensionality and varying density Define core points as a basis for cluster generation Requires density-based concept for (sparsified) proximity graphs Useful for handling differing shapes and sizes

80 Graph-based clustering algorithms (2) Graph-based clustering uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Initially the proximity graph is fully connected MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the simplest case, clusters are connected components in the graph

81 Clustering using sparsification Sparsification - based on the observation that for most data sets objects are highly similar to a small no of objects and weakly similar to most other objects. Sparsification can be made by either removing links that have a similarity below a specified threshold or by keeping links to the k nearest neighbours

82 Clustering using sparsification The amount of data that needs to be processed is drastically reduced Sparsification can eliminate more than 99% of the entries in a proximity matrix The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased Clustering may work better Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters. Sparsification facilitates the use of graph partitioning algorithms E.g. applied in Opossum and Chameleon

83 Graph-based clustering algorithms (3) MST (minimum spanning tree) clustering and OPOSSUM (optimal partitioning of sparse similarities using Metis) Both MST and OPOSSUM relies solely on sparsification MST is equivalent to single-link clustering and was treated in previous course MST (Minimum Spanning Tree): Build MST tree Create clusters by partitioning the tree in order of decreasing distances until clusters include single nodes. Opossum uses a well known graph partitioning algorithm METIS Opossum: 1) compute a sparsified similarity graph 2) partition the graph into k clusters using METIS Opossum pros and cons: simple and fast and handling sparse high-dimensional data roughly equal-sized clusters can be good or bad equal-sized clusters means clusters can be broken or combined

84 MST: Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree) Start with a tree that consists of any point In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not Add q to the tree and put an edge between p and q

85 MST and OPOSSUM MST - use a minimum spanning tree for constructing hierarchy of clusters OPOSSUM partition a sparsified graph using the METIS algorithm

86 Graph-based clustering algorithms (4) Chameleon Hierarchical clustering with dynamic modeling Combines an initial partitioning of data and a novel hierarchical clustering scheme Use sparsification and the concept of self similarity Jarvis-Patrick and ROCK Based on shared-nearest neighbour (SNN) similarity SNN density-based clustering algorithm Based on shared-nearest neighbour density

87 Limitations of Current Cluster Merging Schemes Some existing cluster merging schemes in hierarchical clustering algorithms focus on either closeness or connectivity. MIN (single-link) or CURE: merge two clusters based on their closeness (or minimum distance) GROUP-AVERAGE: merge two clusters based on their average connectivity Using only one of these approaches can lead to mistakes in forming clusters Many cluster techniques also have a global (static) model for clusters. K-means globular clusters DBSCAN clusters defined by single density threshold These schemes have problems handling clusters of widely varying sizes, shapes, and densities

88 Limitations of Current Cluster Merging Schemes (a) (b) (c) (d) Closeness schemes will merge (a) and (b) Average connectivity schemes will merge (c) and (d)

89 Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters Main property is the relative closeness and relative inter-connectivity of the cluster Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters The merging scheme preserves self-similarity One of the areas of application is spatial data

clusters and variation in density within clusters Existence of special artifacts (streaks) and noise

90 Characteristics of spatial data sets Clusters are defined as densely populated regions of the space Clusters have arbitrary shapes, orientation, and nonuniform sizes Difference in densities across clusters and variation in density within clusters Existence of special artifacts (streaks) and noise The clustering algorithm must address the above characteristics and also require minimal supervision

91 Chameleon: steps Preprocessing Step: Represent the data by a graph Given a set of points, construct the k-nearest-neighbor (k-nn) graph to capture the relationship between a point and its k nearest neighbors Concept of neighborhood is captured dynamically (even if region is sparse) Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster

92 Chameleon: steps Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters Two key properties used to model cluster similarity: Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters One approach to combine RI and RC to merge clusters in Chameleon is by maximizing: Maximize RI(C i,c j ) * RC(C i,c j ) α, where α is typically >

93 Chameleon pros and cons Pros: Clusters spatial data effectively even when data include noise and outliers Handles clusters of different shapes sizes and density Cons: Assumes groups of data from sparsification belong to the same true cluster and cannot separate objects through the agglomerative hierarchical clustering step i.e. Chameleon can have problems in high-dimensional space where partitioning will not produce subclusters

94 Experimental Results: CHAMELEON

95 Experimental Results: CHAMELEON

96 Experimental Results: CHAMELEON

97 Experimental Results: CURE (10 clusters)

98 Experimental Results: CURE (15 clusters)

99 Experimental Results: CHAMELEON

100 Experimental Results: CURE (9 clusters)

101 Experimental Results: CURE (15 clusters)

102 Shared Nearest Neighbour (SNN) Approach SNN graph: the weight of an edge is the number of shared neighbours between vertices given that the vertices are connected i j i j

103 Creating the SNN Graph Sparse Graph Shared Near Neighbor Graph Link weights are similarities between neighboring points Link weights are number of Shared Nearest Neighbors

104 Jarvis-Patrick Clustering First, the k-nearest neighbors of all points are found In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph A pair of points is put in the same cluster if any two points share more than T neighbors and the two points are in each others k nearest neighbor list For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors

105 Jarvis-Patrick clustering pros and cons Pros: SNN similarity means JP is good at handling noise and outliers Can handle clusters of different sizes, shapes and densities Works well in high-dimensional space, especially finding tight clusters of strongly related objects Cons: Jarvis-Patrick clustering is somewhat brittle since split/join may depend on very few links May not cluster all objects - but that can be handled separately Complexity O(m 2 ) (or O(m log m) in low-dimensional space) Finding best parameter settings can be challenging

106 When Jarvis-Patrick Works Well Original Points Jarvis Patrick Clustering 6 shared neighbors out of

107 When Jarvis-Patrick does NOT work well Smallest threshold, T, that does not merge clusters. Threshold of T

108 ROCK (RObust Clustering using links) ROCK uses a similar number of shared neighbors proximity measure as Jarvis-Patrick. Clustering algorithm for data with categorical and Boolean attributes A pair of points is defined to be neighbors if their similarity is greater than some threshold Use a hierarchical clustering scheme to cluster the data. 1. Obtain a sample of points from the data set 2. Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points 3. Perform an agglomerative hierarchical clustering on the data using the number of shared neighbors as similarity measure and maximizing the shared neighbors objective function 4. Assign the remaining points to the clusters that have been found

109 SNN Density Clustering The Shared Nearest Neighbour Density algorithm is based upon SNN similarity in combination with a DBSCAN approach SNN density measures the degree to which a point is surrounded by similar points High/low density areas tend to have high SNN density Transition areas (high-low density) tend to have low SNN density

110 SNN Density a) All Points (0 to 50 NN) b) High SNN Density (points having 34 or more NN) c) Medium SNN Density Density (points having NN) d) Low SNN Density (points having 17 or less NN)

111 SNN Density clustering algorithm 1. Compute the similarity matrix This corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points 2. Sparsify the similarity matrix by keeping only the k most similar neighbors This corresponds to only keeping the k strongest links of the similarity graph 3. Construct the shared nearest neighbor graph from the sparsified similarity matrix. At this point, one could have applied a similarity threshold and find the connected components to obtain the clusters (i.e. Jarvis-Patrick algorithm) 4. Find the SNN density of each Point. Using a user-specified parameter, Eps, find the number of points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point

112 SNN Density clustering algorithm 5. Find the core points Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density greater than MinPts 6. Form clusters from the core points If two core points are within a radius, Eps, of each other they are placed in the same cluster 7. Discard all noise points All non-core points that are not within a radius of Eps of a core point are discarded 8. Assign all non-noise, non-core points to clusters This can be done by assigning such points to the nearest core point (Note that steps 4-8 are DBSCAN)

113 SNN Density Clustering can handle differing densities Original Points SNN Density Clustering

114 SNN Density Clustering can handle other difficult situations

Finding Clusters of Time Series In Spatio-Temporal Data latitude 90 60 30 26 SLP Clusters via Shared Nearest Neighbor

-150-120 -90-60 -30 0 30 60 90 120 150 180 longitude SNN Clusters of SLP.

115 Finding Clusters of Time Series In Spatio-Temporal Data latitude SLP Clusters via Shared Nearest Neighbor Clustering (100 NN, ) longitude SNN Clusters of SLP. latitude 90 SNN Density of SLP Time Series Data longitude SNN Density of Points on the Globe

116 Features and limitations of SNN Clustering Has similar strengths and limitations as Jarvis-Patrick clustering Does not cluster all the points Complexity of SNN Clustering is high O( n * time to find numbers of neighbours within Eps) In worst case, this is O(n 2 ) For lower dimensions, there are more efficient ways to find the nearest neighbours, such as R* trees and K-d trees The approach of using core points and SNN density adds power and flexibility

117 Issues: Scalable clustering algorithms Storage requirements Computational requirements Approaches: Multi-dimensional or spatial access methods e.g. k-d tree, R*-tree and X-tree forms hierarchical partition of data space Also grid-based clustering partition data space Bounds on proximities Reduces no of proximity calculations Sampling Clusters samples of points E.g. CURE Partitioning data objects Bisecting K-means, CURE Summarization Represents summarizations of data BIRCH

118 Scalable clustering algorithms Approaches continued: Parallel and distributed computation Parallel and distributed versions of clustering algorithms Conventional parallel master-slave programming approaches Map-reduce / Hadoop based approaches Algorithms specifically designed for large-scale clustering DBSCAN (including spatial indexing) BIRCH (balanced iterative reducing and clustering using hierarchies) Builds a compressed tree structure of clustering features (CF) in a CF-tree including measures for inter/intra cluster cohesion and separation CURE (clustering using representatives applying sampling and partitioning) CLARANS (clustering large applications based on K-medoids and randomized search) Parallel and distributed versions of earlier clustering algorithms PBIRCH parallel version of BIRCH ICLARANS parallel version of CLARANS PKMeans (parallel version of K-Means), K-Means++ (map reduce versions of K-Means) Parallel DBSCAN (parallel version of DBSCAN), MR-DBSCAN (map reduce versions)

119 Computational complexity of clustering algorithms (from book Clustering, Xu and Wunsch, 2009) TABLE 8.1. Computational complexity of clustering algorithms. Algorithms like BIRCH and WaveCluster can scale linearly with the input size and handle very large data sets. Cluster Algorithm Complexity Suitable for High Dimensional Data K - means O ( NKd ) (time) O ( N + K ) (space) No Fuzzy c - means Near O ( N ) No Hierarchical O ( N 2 ) (time and space) No clustering * PAM O ( K ( N K ) 2 ) No CLARA + O ( K (40 + K ) 2 + K ( N K )) (time) No CLARANS Quadratic in total performance No BIRCH O ( N ) (time) No DBSCAN O ( N log N ) (time) No CURE 2 ON ( sample log Nsample )( time) ON ( sample )( space ) Yes WaveCluster O ( N ) (time) No DENCLUE O ( N log N ) (time) Yes FC O ( N ) (time) Yes STING O (Number of cells at the bottom layer) No CLIQUE Linear with the number of objects, Quadratic Yes with the number of dimensions OptiGrid Between O ( Nd ) and O ( Nd log N ) Yes ORCLUS OK + KNd+ Kd space 2 OKd space # Yes ( )( ) ( 0 )( ) * Include single - linkage, complete - linkage, average - linkage, etc. + Based on the heuristic for drawing a sample from the entire data set (Kaufman and Rousseeuw, 1990 ) #K 0 is the number of initial seeds

120 Comparison of Clustering Techniques

121 BIRCH - balanced iterative reducing and clustering using hierarchies BIRCH is designed for clustering large amounts of numeric data by integrating hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage). It overcomes the two difficulties in agglomerative clustering methods: scalability and the inability to undo what was done in the previous step. BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-tree) to represent a cluster hierarchy. Achieves good performance and scalability (O(n)) in large or even streaming databases, and also make it effective for incremental and dynamic clustering of incoming objects. The ideas of BIRCH to use clustering features and CF-trees have been borrowed by many others to tackle problems of clustering streaming and dynamic data

122 The clustering feature (CF) of the cluster is a 3-D vector summarizing information about clusters of objects. It is defined as: CF = (n, LS, SS) where n is the no of data points, LS is the linear sum of the points and SS is the square sum of the points. The clustering feature can be used to derive many useful statistics of a cluster. For example, the cluster s centroid, c is given by: c = LS / n BIRCH - balanced iterative reducing and clustering using hierarchies Clustering features are also additive. That is, for two disjoint clusters, C1 and C2, with the clustering features CF1 = (n1, LS1, SS1) and CF2 = (n2, LS2, SS2), respectively, the clustering feature for the cluster formed by merging C1 and C2 is simply CF1 + CF2 = (n1+n2, LS1+LS2, SS1+SS2)

123 BIRCH - balanced iterative reducing and clustering using hierarchies A CF-tree is a height-balanced tree storing clustering features for a hierarchical clustering. A nonleaf node in a tree has a no of children and nonleaf nodes also store sums of the CFs of their children - summarizing clustering information about their children. A CF-tree has two parameters: branching factor, B, and threshold, T. The branching factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. These two parameters implicitly control the resulting tree s size. CF 1 CF 2 CF k Root level CF 11 CF 12 CF 1k First level

124 Outline of BIRCH BIRCH applies a multiphase clustering technique: A single scan of the data set yields a basic, good clustering, and one or more additional scans can optionally be used to further improve the quality

125 CURE - another hierarchical approach CURE Clustering Using Representatives Represents each cluster with multiple representative points from the cluster First point farthest from center and next as far away as possible from the previous set. Uses a variety of techniques to handle large data sets, outliers and clusters with non-spherical shapes and non-uniform sizes. Applies an agglomerative hierarchical clustering for the actual clustering into a desired no of clusters specified by user. Sampling and partitioning are applied to improve computational performance. Partitions a random sample of data points and performs hierarchical clustering on the partitioned sample. The clusters found in the initial clustering is then clustered into the final number of desired clusters. A final pass assigns each remaining point in the data set to the closest cluster, i.e. cluster with closest representative point

126 CURE: another hierarchical approach Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster by a shrinkage factor (typically between 0.2 and 0.7). Cluster similarity is the similarity of the closest pair of representative points from different clusters Shrinking representative points toward the center helps avoid problems with noise and outliers CURE is able to handle clusters of arbitrary shapes and sizes through the representation of the clusters by multiple representative points

127 Outline of the CURE algorithm Where m no of data points, p partitions, q is the desired reduction of points in a partition, i.e. no of clusters in a partition is m/pq, and K is the desired no of clusters

128 Experimental Results: CURE Picture from CURE, Guha, Rastogi, Shim

129 Experimental Results: CURE (centroid) (single link) Picture from CURE, Guha, Rastogi, Shim

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean