DATA MINING II - 1DL460

Size: px
Start display at page:

Download "DATA MINING II - 1DL460"

Transcription

1 DATA MINING II - 1DL460 Spring 2017 A second course in data mining Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden Tore Risch- UDBL - IT - UU

2 Cluster validation (Tan, Steinbach, Kumar ch. 8.5) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden Tore Risch- UDBL - IT - UU

3 Cluster validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the goodness of the resulting clusters? But clusters are in the eye of the beholder! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters,

4 Clusters found in random data Random Points y x ydbscan x K- means y y Complete Link x x

5 Different aspects of cluster validation 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Determining the correct number of clusters. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 5. Comparing the results of two different sets of cluster analyses to determine which is better. Note: 1, 2 and 3 make no use of external information while 4 require external information. 5 might and might not use external information. For 3, 4 and 5, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters

6 Measures of cluster validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. Internal index (unsupervised evaluation): Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) External index (supervised evaluation): Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Relative index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy Sometimes these are referred to as criteria instead of indices However, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion

7 Internal measures: measuring cluster validity via correlation Two matrices P is the proximity matrix where P(i,j) = d(xi,xj) i.e. the proximity/similarity between xi and xj Q is the incidence matrix or cluster distance matrix where Q(i,j) = d(cxi,cxj ), i.e.: one row and one column for each data point an entry is 1 if the associated pair of points belong to the same cluster an entry is 0 if the associated pair of points, belongs to different clusters Compute the correlation between the two matrices Since the matrices are symmetric, only the correlation between n (n-1) / 2 entries needs to be calculated. The modified Hubert Γ statistic calculates the correlation between P and Q. Γ D = 1 M n 1 n i=1 j=i+1 Q(i, j)p(i, j), where M = n(n-1)/2, the no of pairs of different points. High correlation indicates that points that belong to the same cluster are close to each other. Works best for globular clusters but not a good measure for some density or contiguity based clusters

8 Internal measures: measuring cluster validity via correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. y x y x Corr = Corr =

9 Using similarity matrix for cluster validation Order the similarity matrix with respect to cluster labels and inspect visually. sampling can be applied to large data sets y x Points Similarity 0 Points

10 Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x DBSCAN

11 Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x K-means

12 Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x Complete Link

13 Using similarity matrix for cluster validation Clusters of non-globular clusterings are not as clearly separated DBSCAN

14 Internal measures: SSE Internal index: Used to measure the goodness of a clustering structure without respect to external information SSE SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of clusters SSE K

15 Internal measures: SSE Clusters in more complicated figures are sometimes not so well separated SSE curve for a more complicated data set where elbows are not as clearly identified SSE of clusters found using K-means

16 Unsupervised cluster validity measure More generally, given K clusters: Validity(C i ): a function of cohesion, separation, or both Weight w i associated with each cluster i For SSE: w i = 1 ( ) validity(c i ) = x µ 2 i x C i

17 Internal measures: Cohesion and Separation Cluster cohesion: measures how closely related objects are in a cluster Cluster separation: measures how distinct or well-separated a cluster is from other clusters

18 Internal measures: Cohesion and Separation A proximity graph-based approach can be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation

19 Internal measures: Cohesion and Separation A prototype-based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the proximities with respect to the prototype (centroid or medoid) of the cluster. Cluster separation is measured by the proximity of the cluster prototypes. cohesion separation

20 Graph-based versus Prototype-based views

21 Graph-based view Cluster cohesion: measures how closely related objects are in a cluster Cohesion( C i ) = x C, y C i i proximity ( x, y) Cluster separation: measures how distinct or well-separated a cluster is from other clusters Separation ( C i, C j ) = x C, y C i j proximity( x, y)

22 Prototype-based view Cluster Cohesion: Cohesion ( C ) = proximity ( x, ) i c i x Equivalent to SSE if proximity is square of Euclidean distance C i Cluster Separation: Separation( C Separation( C i i, C j ) = ) = proximity ( c proximity ( c i i, c), c j )

23 Unsupervised cluster validity measures

24 Prototype-based vs Graph-based cohesion It can be shown that the prototype-based and graph-based approaches are equivalent in some proximity measures: For SSE and points in Euclidean space, it can be shown that average pairwise difference between points in a cluster is equivalent to SSE of the cluster

25 Total Sum of Squares (TSS) When using Euclidean distance as proximity measure, the traditional measure of separation between clusters is the between group sum of squares (SSB) c 2 c 1 c c: overall mean c i : centroid of each cluster C i c 3 m i : number of points in cluster C i TSS = SSE= SSB = k i= 1 x C k i= 1 dist( x, c) i m dist( c i dist( x, c i 2 i ), c)

26 Total Sum of Squares (TSS) SSE: sum of the total sum of squared error, i.e. SSE for all clusters. SSB: the between group sum of squares summed over all clusters. It can be shown that the total sum of squares (TSS) is constant, so given a data set, TSS is fixed: TSS = SSE + SSB A clustering with large SSE has small SSB, while one with small SSE has large SSB Goal is to minimize SSE and maximize SSB

27 Total Sum of Squares (TSS) m 1 m m 2 5 K=1 cluster: K=2 clusters: TSS = SSE + SSB TSS = (1 3) SSE= (3 1) SSB = 4 (3 3) TSS = (1 3) SSB= 2 (3 1.5) SSE= (1 1.5) + (2 3) + (3 2) + (2 3) (2 1.5) 2 = (4 3) + (4 3) + (4 3) 2 + (4 4.5) + 2 (4.5 3) (5 3) + (5 3) + (5 3) = = 10 = 10 = 10 + (5 4.5) 2 =

28 Internal Measures: Silhouette coefficient Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = minimum of (average distance of i to points in another cluster) The silhouette coefficient for a point is then given by s i = (b i a i) / max(a i,b i ), where s i [-1,1] (simplified expression: s = 1 a/b if a < b, (or s = b/a - 1 if a b, not the usual case)) Typically between 0 and 1. The closer to 1 the better. a b Can calculate the Average silhouette width for a cluster or a clustering

29 Example silhouette coefficient

30 Correct number of clusters? Visually looking for knees, peaks or dips in the plot of an evaluation measure plotted against no of clusters can indicate the natural number of clusters Does not always work well, e.g. when clusters are overlapping, intertwined or not clearly separated There is a distinct knee in the SSE and a distinct peak in the silhouette coefficient

31 Clustering tendency Methods for evaluation if data has clusters without clustering Most common approach in Euclidian space is to apply a statistical test for spatial randomness can be quite challenging to select correct model and parameters Example of Hopkins statistic (se blackboard example)

32 Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) Distance matrix: Single Link

33 Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) Single Link Cophenetic Distance Matrix for Single Link CPCC (CoPhenetic Correlation Coefficient) Correlation between original distance matrix and cophenetic distance matrix Cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. CPCC is a standard measure of how well a hierarchical clustering fits the data CPCC is different for different types of hierarchical clusterings

34 Unsupervised evaluation of hierarchical clustering CoPhenetic Correlation Coefficient (CPCC) 0.2 Single Link Complete Link Comparing CPCC for different clustering techniques:

35 Supervised measures of cluster validity: classification-oriented measures of cluster validity Entropy The degree to which each cluster consists of objects of a single class Purity Another measure of the extent to which a cluster contains objects of a single class Precision The fraction of a cluster that consists of objects of a specific class Recall The extent to which a cluster contains all object of a specified class F-measure A combination of precision and recall that measures the extent to which a cluster contains only objects of a particular class and all objects of that class

36 External measures of cluster validity: Entropy and Purity

37 Supervised measures of cluster validity: similarity-oriented measures of cluster validity A similarity-oriented measure of cluster validity is based on the idea that any two objects in the same cluster should be in the same class. Hence, these measures the extent of which two objects of the same class also are in the same cluster and vice versa. Can be expressed as comparing the two matrices: Ideal cluster similarity matrix that has a 1 in the ij th entry if two objects i and j are in the same cluster and 0 otherwise. Ideal class similarity matrix defined with respect to class labels which has a 1 in the ij th entry if two objects i and j belong to the same class and 0 otherwise. The correlation between these two matrices, called the Γ (gamma) statistic, can be taken as a measure of cluster validity

38 Supervised measures of cluster validity: similarity-oriented measures of cluster validity Correlation between cluster and class matrices Five data points: p 1, p 2, p 3, p 4 and p 5. Two clusters: C 1 = {p 1, p 2, p 3 } and C 2 = {p 4, p 5 } Two classes: L 1 = {p 1, p 2 } and L 2 = {p 3, p 4, p 5 } Correlation between these matrices is (0,359 *) 0,1667 ((*) calculation according to Tan book unclear what formula used! See notes on blackboard!)

39 Supervised measures of cluster validity: similarity-oriented measures of cluster validity Cluster validity measures based on the contingency table: Same cluster Different cluster Same class f 11 f 10 Different class f 01 f 00 f 11 : number of pairs of objects having the same class and the same cluster f 01 : number of pairs of objects having a different class and the same cluster f 10 : number of pairs of objects having the same class and a different cluster f 00 : number of pairs of objects having a different class and a different cluster

40 Supervised measures of cluster validity: similarity-oriented measures of cluster validity Cluster validity measures based on the contingency table: Same cluster Different cluster Same class f 11 = 2 f 10 = 2 Different class f 01 = 2 f 00 = 4 Example of two frequently used cluster validity measures are the Rand statistic and the Jaccard coefficient (see example on blackboard)

41 Supervised measures of cluster validity: cluster validity for hierarchical clustering's Supervised evaluation of hierarchical clustering is more difficult E.g. preexisting hierarchical structure might be hard to find The hierarchical F-measure evaluate a hierarchical clustering with respect to a flat set of class labels

42 Supervised measures of cluster validity: cluster validity for hierarchical clustering's The idea of the hierarchical F-measure is to evaluate whether the hierarchical clustering, for each class, contains at least one cluster that is relatively pure and contain most objects of that class. Compute (for each class) the F-measure for each cluster in the hierarchy. Retrieve the maximum F-measure for each class attained for any cluster Calculate the total F-measure as the weighted average of all per-class F-measures Hierarchical F-measure:

43 Significance of cluster validity measures? How to interpret the significance of a calculated evaluation measure? Min/max values gives some guidance However min/max might not be available or scale may affect interpretation Different application might tolerate different values Common solution: to interpret value of validity measure in statistical terms. See following examples:

44 Framework for cluster validity Need a framework to interpret any measure. For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? Statistics provide a framework for cluster validity The more atypical a clustering result is, the more likely it represents valid structure in the data Can compare the values of an index that result from random data or clusterings to those of a clustering result. If the value of the index is unlikely, then the cluster results are valid For comparing the results of two different sets of cluster analyses, a framework is less necessary. However, there is the question of whether the difference between two index values is significant

45 Statistical Framework for SSE Example: 2-d data with 100 points y x Suppose a clustering algorithm produces SSE = Does it mean that the clusters are statistically significant?

46 Example continued: Statistical framework for SSE Generate 500 sets of random data points of size 100 distributed over the range for x and y values Perform clustering with k = 3 clusters for each data set Plot histogram of SSE and compare with the value for the 3 well-separated clusters y x Count SSE

47 Statistical framework for correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. y x y x Corr = (statistically significant) Corr = (not statistically significant)

48 Final comment on cluster validity The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering Data, Jain and Dubes

49 Alt. Clustering Techniques Cluster analysis: additional issues and algorithms (Tan, Steinbach, Kumar ch. 9) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden Tore Risch- UDBL - IT - UU

50 Characteristics of data, clusters and clustering algorithms No algorithm suitable for all types of data, clusters, and applications Characteristics of data, clusters, and algorithms strongly impact clustering Important for understanding, describing and comparing clustering techniques

51 Characteristics of data High dimensionality Problem with density and proximity measures Size Dimensionality reduction techniques one approach to address this problem Redefinition of proximity and density is another (section 9.4.5, 9.4.7) Many algorithms have time or space complexity of O(m 2 ), m being the no of objects (section 9.5) Sparseness What type of sparse data? Asymmetric attributes? Magnitude important or just occurences? Noise and outliers How are noise and outliers affecting a specific algorithm? (dealt with in DBSCAN, Chameleon, (section 9.4.4) SNN density (9.4.8), CURE (9.5.3) Types of attributes and data set Scale Different proximity and density measures needed for different types of data (ch 2) Normalization can be required (ch 2.3.7) Mathematical properties of the data space E.g. is mean or density meaningful for your data?

52 Data distribution Characteristics of clusters Some clustering algorithms assume a particular type of distribution of data, e.g. mixture models (9.2.2) Shape Can a specific algorithm handle the expected cluster shapes? (DBSCAN, single-link, Chameleon, CURE) Differing sizes Can a specific algorithm handle varying cluster sizes? (section 9.6) Differing densities Can a specific algorithm handle varying cluster densities? (SNN density) Poorly separated clusters Can a specific algorithm handle overlapping clusters? (fuzzy clustering, 9.2.1) Relationships among clusters E.g. self-organizing maps (SOM) considers the relationships between clusters Subspace clusters Clusters can exist in many distinct subspaces (subspace clustering - DENCLUE)

53 Characteristics of clustering algorithms Order dependence Quality and number of clusters can be order dependent, e.g. SOM Nondeterminism E.g. K-means produces different results for every execution Scalability Required for large data sets Linear or near-linear time and space-complexity preferred O(m 2 ) less practical for large data sets Parameter selection Can be challenging to set parameter values Transforming the clustering problem to another domain E.g. graph partitioning Treating clustering as an optimization problem Exhaustive search is computationally infeasible In reality one need some heuristics that produce good (but maybe not optimal) results

54 Types of clustering algorithms One way of dividing clustering algorithms: Prototype-based clustering algorithms Density-based clustering algorithms Graph-based clustering algorithms Scalable clustering algorithms

55 Prototype-based clustering algorithms K-means centroid-based Fuzzy clustering Objects are allowed to belong to more than one cluster Relies on fuzzy set theory that allows an object to belong to a set with a degree of membership between 0 and 1. Applied in e.g. Fuzzy c-means algorithm (fuzzy version of K-means) Mixture models clustering Clusters modeled as statistical distributions EM (expectation-maximization) algorithm Self-organizing maps (SOM) Clusters constrained to have fixed relationships to neighbours e.g. two-dimensional grid structure

56 Example of Fuzzy c-means clustering

57 Fuzzy c-means clustering pros and cons Do not restrict data points to belong to one cluster instead indicate the degree that a point belong to any cluster Also share some strength and weaknesses (like non-globular shapes) of K-means although more computationally intensive

58 Mixture models clustering Clusters modeled as statistical distributions Exemplified by EM (expectation-maximization) algorithm

59 EM (expectation-maximization) algorithm E-step: prob(distribution j xi, Ɵ) is the probability that a point xi came from a particular (commonly Gaussian) distribution j, (Ɵj where 1 j k, and Ɵj = (μj, σj), calculated using the estimated parameters Ɵ prob ( x / Θ ) prob(distribution j xi, Ɵ) = prob ( x / Θ ) i l=1 j k i l M-step: parameters Ɵ are maximized by, for all n data points, calculating : n 1 i=1 xi prob (Θ j xiθ) µi = k n prob (Θ j xiθ) i=1 n prob Θ x Θ x u σ j = 1k i=1 n ( j i )( i j ) i=1 prob(θ j xiθ)

60 EM (expectation-maximization) algorithm prob(distribution j xi, Ɵ) is the probability that a point xi came from a particular (commonly Gaussian) distribution j, calculated using the estimated parameters Ɵ Compare K-means where centroids are updated in each step

61 EM clustering example

62 Additional EM clustering examples

63 EM clustering pros and cons Pros: More general than K-means using various types of distributions Can find clusters of different sizes and elliptical shapes Condensed description of clusters through a small no of parameters Cons: Can be slow for models including a large number of components Does not work well for a low number of data points Problems of estimating the number of clusters or more generally to chose the exact form of the model Noise and outliers might cause problems

64 Self organizing maps (SOM) The goal of SOM is to find a set of centroids (reference vectors using SOM terminology) and assigning each data point to the centroid that provides the best approximation of that data point. In self-organizing maps (SOM), centroids have a predetermined topographic ordering relationship During the training process, SOM uses each data point to update the closest centroid and nearby centroids in the topographic ordering For example, the centroids of a two-dimensional SOM can be viewed as a structure on a 2Dsurface that tries to fit the n-dimensional data as good as possible

65 Kohonen self organizing maps Similarity between centroid nodes and input nodes: Input: X = <x 1,, x h > Weights: <w 1i,, w hi > Similarity defined based on dot product Centroid node i most similar to input wins Winning node weights (as well as surrounding node weights) are increased(*). sim (X, i) = x j w ji, (summation over j=1 to h) w kj = c(x k -w kj ) if j N i, 0 otherwise c indicates learning rate N i union of centroid i and its neighborhood centroids (*) alternative formulation imply that centroids and neighborhood move closer to current input node

66 Example of SOM application In self-organizing maps, clusters are constrained to have fixed relationships to neighbours

67 Example of SOM application In self-organizing maps, clusters are constrained to have fixed relationships to neighbours

68 Illustration of a self organizing map (SOM) Illustration of SOM network where black dots are data and green dots represent SOM nodes (source:

69 Self organizing maps pros and cons Pros: Related clusters are close which facilitate the interpretation and visualization of clustering results Cons: Settings of several parameters, neighborhood function, grid type, no of centroids SOM cluster might not correspond to single natural cluster, it can encompass several natural clusters or a single natural cluster might span several centroids Lacks an objective function that can make it problematic to compare different SOM clustering results Convergence can be slow and is not guaranteed but normally fulfilled

70 Density-based clustering algorithms (1) DBSCAN Core points, border points and noise Grid-based clustering Objects are assigned to grid cells corresponding to attribute intervals Subspace clustering There can be different clusters in different subspaces

71 Density-based clustering algorithms (2) Grid-based clustering:

72 Grid-based clustering pros and cons Pros: Grid-based clustering can be very efficient and effective with a O(m), m no of points, complexity for defining the grid and an overall complexity of O(m log m) Cons: Dependent of the choice of density threshold To high clusters will be lost To low separate clusters may be joined. Differing densities might make it hard to find a single threshold that works for complete data space Rectangular grid may have problems accurately capturing circular boundaries Grid-based clustering tends to work poorly for high-dimensional data

73 Density-based clustering algorithms (3) Subspace clustering: Subspace clustering considers subsets of features Clusters found in subspaces can be quite different Data may be clustered with respect to a small set of attributes but random for others Different clusters might exist in different sets of dimensions

74 Density-based clustering algorithms (4) CLIQUE (Clustering In QUEst) Grid-based clustering that methodologically finds subspace clusters No of subspaces is exponential in no of dimensions so some efficient pruning technique needed CLIQUE relies on the monotonicity property of density-based clusters: if a set of points forms a density-based cluster in k-dimensions, then the same point is also part of a density-based cluster in all possible subsets of those dimensions. Based on well-known Apriori principle from association analysis

75 Density-based clustering algorithms (5) CLIQUE pros and cons Well understood since based on well-known Apriori principle Can summarize the list of cells in clusters with a small set of inequalities Clusters can overlap and make interpretation difficult Potential exponential time complexity to many cells might be generated for lower values of k, i.e. for low dimensions

76 Density-based clustering algorithms (6) DENCLUE (density clustering) Models the overall density of a set of points as the sum of the influence (or kernel) functions associated with each point Based on kernel density estimation, a well-developed area of statistics. Peaks (local density attractors) forms the basis for forming clusters A minimum density threshold, ξ, separates clusters from data points considered to be noise Grid-based implementation defines neighborhood and reduces complexity

77 Density-based clustering algorithms (7) DENCLUE:

78 Density-based clustering algorithms (8) DENCLUE pros and cons Solid theoretical foundation Good at handling noise and outliers and finding clusters of different shapes and sizes Can be computationally expensive Susceptible to accuracy dependence to choice of grid size Problems with high-dimensionality and different densities

79 Graph-based clustering algorithms (1) In graph-based clustering data objects are represented as nodes and proximity as the edge weight between two corresponding nodes. Discussed agglomerative (hierarchical) clustering algorithms in dm1 course. Here some more advanced graph-based clustering algorithms are presented that applies different subsets of the following key approaches: Sparsification of proximity graph keeping only connections of an object to its nearest neighbours Useful for handling noise and outliers Makes it possible to apply efficient graph partitioning algorithms Use information in proximity graph for more sophisticated merging strategies E.g. will merged cluster have similar characteristics as original unmerged clusters Similarity measures between objects based on the no of shared nearest neighbours Useful for handling problems with high dimensionality and varying density Define core points as a basis for cluster generation Requires density-based concept for (sparsified) proximity graphs Useful for handling differing shapes and sizes

80 Graph-based clustering algorithms (2) Graph-based clustering uses the proximity graph Start with the proximity matrix Consider each point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two points Initially the proximity graph is fully connected MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph In the simplest case, clusters are connected components in the graph

81 Clustering using sparsification Sparsification - based on the observation that for most data sets objects are highly similar to a small no of objects and weakly similar to most other objects. Sparsification can be made by either removing links that have a similarity below a specified threshold or by keeping links to the k nearest neighbours

82 Clustering using sparsification The amount of data that needs to be processed is drastically reduced Sparsification can eliminate more than 99% of the entries in a proximity matrix The amount of time required to cluster the data is drastically reduced The size of the problems that can be handled is increased Clustering may work better Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points. The nearest neighbors of a point tend to belong to the same class as the point itself. This reduces the impact of noise and outliers and sharpens the distinction between clusters. Sparsification facilitates the use of graph partitioning algorithms E.g. applied in Opossum and Chameleon

83 Graph-based clustering algorithms (3) MST (minimum spanning tree) clustering and OPOSSUM (optimal partitioning of sparse similarities using Metis) Both MST and OPOSSUM relies solely on sparsification MST is equivalent to single-link clustering and was treated in previous course MST (Minimum Spanning Tree): Build MST tree Create clusters by partitioning the tree in order of decreasing distances until clusters include single nodes. Opossum uses a well known graph partitioning algorithm METIS Opossum: 1) compute a sparsified similarity graph 2) partition the graph into k clusters using METIS Opossum pros and cons: simple and fast and handling sparse high-dimensional data roughly equal-sized clusters can be good or bad equal-sized clusters means clusters can be broken or combined

84 MST: Divisive Hierarchical Clustering Build MST (Minimum Spanning Tree) Start with a tree that consists of any point In successive steps, look for the closest pair of points (p, q) such that one point (p) is in the current tree but the other (q) is not Add q to the tree and put an edge between p and q

85 MST and OPOSSUM MST - use a minimum spanning tree for constructing hierarchy of clusters OPOSSUM partition a sparsified graph using the METIS algorithm

86 Graph-based clustering algorithms (4) Chameleon Hierarchical clustering with dynamic modeling Combines an initial partitioning of data and a novel hierarchical clustering scheme Use sparsification and the concept of self similarity Jarvis-Patrick and ROCK Based on shared-nearest neighbour (SNN) similarity SNN density-based clustering algorithm Based on shared-nearest neighbour density

87 Limitations of Current Cluster Merging Schemes Some existing cluster merging schemes in hierarchical clustering algorithms focus on either closeness or connectivity. MIN (single-link) or CURE: merge two clusters based on their closeness (or minimum distance) GROUP-AVERAGE: merge two clusters based on their average connectivity Using only one of these approaches can lead to mistakes in forming clusters Many cluster techniques also have a global (static) model for clusters. K-means globular clusters DBSCAN clusters defined by single density threshold These schemes have problems handling clusters of widely varying sizes, shapes, and densities

88 Limitations of Current Cluster Merging Schemes (a) (b) (c) (d) Closeness schemes will merge (a) and (b) Average connectivity schemes will merge (c) and (d)

89 Chameleon: Clustering Using Dynamic Modeling Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters Main property is the relative closeness and relative inter-connectivity of the cluster Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters The merging scheme preserves self-similarity One of the areas of application is spatial data

90 Characteristics of spatial data sets Clusters are defined as densely populated regions of the space Clusters have arbitrary shapes, orientation, and nonuniform sizes Difference in densities across clusters and variation in density within clusters Existence of special artifacts (streaks) and noise The clustering algorithm must address the above characteristics and also require minimal supervision

91 Chameleon: steps Preprocessing Step: Represent the data by a graph Given a set of points, construct the k-nearest-neighbor (k-nn) graph to capture the relationship between a point and its k nearest neighbors Concept of neighborhood is captured dynamically (even if region is sparse) Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster

92 Chameleon: steps Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters Two key properties used to model cluster similarity: Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters One approach to combine RI and RC to merge clusters in Chameleon is by maximizing: Maximize RI(C i,c j ) * RC(C i,c j ) α, where α is typically >

93 Chameleon pros and cons Pros: Clusters spatial data effectively even when data include noise and outliers Handles clusters of different shapes sizes and density Cons: Assumes groups of data from sparsification belong to the same true cluster and cannot separate objects through the agglomerative hierarchical clustering step i.e. Chameleon can have problems in high-dimensional space where partitioning will not produce subclusters

94 Experimental Results: CHAMELEON

95 Experimental Results: CHAMELEON

96 Experimental Results: CHAMELEON

97 Experimental Results: CURE (10 clusters)

98 Experimental Results: CURE (15 clusters)

99 Experimental Results: CHAMELEON

100 Experimental Results: CURE (9 clusters)

101 Experimental Results: CURE (15 clusters)

102 Shared Nearest Neighbour (SNN) Approach SNN graph: the weight of an edge is the number of shared neighbours between vertices given that the vertices are connected i j i j

103 Creating the SNN Graph Sparse Graph Shared Near Neighbor Graph Link weights are similarities between neighboring points Link weights are number of Shared Nearest Neighbors

104 Jarvis-Patrick Clustering First, the k-nearest neighbors of all points are found In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph A pair of points is put in the same cluster if any two points share more than T neighbors and the two points are in each others k nearest neighbor list For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors

105 Jarvis-Patrick clustering pros and cons Pros: SNN similarity means JP is good at handling noise and outliers Can handle clusters of different sizes, shapes and densities Works well in high-dimensional space, especially finding tight clusters of strongly related objects Cons: Jarvis-Patrick clustering is somewhat brittle since split/join may depend on very few links May not cluster all objects - but that can be handled separately Complexity O(m 2 ) (or O(m log m) in low-dimensional space) Finding best parameter settings can be challenging

106 When Jarvis-Patrick Works Well Original Points Jarvis Patrick Clustering 6 shared neighbors out of

107 When Jarvis-Patrick does NOT work well Smallest threshold, T, that does not merge clusters. Threshold of T

108 ROCK (RObust Clustering using links) ROCK uses a similar number of shared neighbors proximity measure as Jarvis-Patrick. Clustering algorithm for data with categorical and Boolean attributes A pair of points is defined to be neighbors if their similarity is greater than some threshold Use a hierarchical clustering scheme to cluster the data. 1. Obtain a sample of points from the data set 2. Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points 3. Perform an agglomerative hierarchical clustering on the data using the number of shared neighbors as similarity measure and maximizing the shared neighbors objective function 4. Assign the remaining points to the clusters that have been found

109 SNN Density Clustering The Shared Nearest Neighbour Density algorithm is based upon SNN similarity in combination with a DBSCAN approach SNN density measures the degree to which a point is surrounded by similar points High/low density areas tend to have high SNN density Transition areas (high-low density) tend to have low SNN density

110 SNN Density a) All Points (0 to 50 NN) b) High SNN Density (points having 34 or more NN) c) Medium SNN Density Density (points having NN) d) Low SNN Density (points having 17 or less NN)

111 SNN Density clustering algorithm 1. Compute the similarity matrix This corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points 2. Sparsify the similarity matrix by keeping only the k most similar neighbors This corresponds to only keeping the k strongest links of the similarity graph 3. Construct the shared nearest neighbor graph from the sparsified similarity matrix. At this point, one could have applied a similarity threshold and find the connected components to obtain the clusters (i.e. Jarvis-Patrick algorithm) 4. Find the SNN density of each Point. Using a user-specified parameter, Eps, find the number of points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point

112 SNN Density clustering algorithm 5. Find the core points Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density greater than MinPts 6. Form clusters from the core points If two core points are within a radius, Eps, of each other they are placed in the same cluster 7. Discard all noise points All non-core points that are not within a radius of Eps of a core point are discarded 8. Assign all non-noise, non-core points to clusters This can be done by assigning such points to the nearest core point (Note that steps 4-8 are DBSCAN)

113 SNN Density Clustering can handle differing densities Original Points SNN Density Clustering

114 SNN Density Clustering can handle other difficult situations

115 Finding Clusters of Time Series In Spatio-Temporal Data latitude SLP Clusters via Shared Nearest Neighbor Clustering (100 NN, ) longitude SNN Clusters of SLP. latitude 90 SNN Density of SLP Time Series Data longitude SNN Density of Points on the Globe

116 Features and limitations of SNN Clustering Has similar strengths and limitations as Jarvis-Patrick clustering Does not cluster all the points Complexity of SNN Clustering is high O( n * time to find numbers of neighbours within Eps) In worst case, this is O(n 2 ) For lower dimensions, there are more efficient ways to find the nearest neighbours, such as R* trees and K-d trees The approach of using core points and SNN density adds power and flexibility

117 Issues: Scalable clustering algorithms Storage requirements Computational requirements Approaches: Multi-dimensional or spatial access methods e.g. k-d tree, R*-tree and X-tree forms hierarchical partition of data space Also grid-based clustering partition data space Bounds on proximities Reduces no of proximity calculations Sampling Clusters samples of points E.g. CURE Partitioning data objects Bisecting K-means, CURE Summarization Represents summarizations of data BIRCH

118 Scalable clustering algorithms Approaches continued: Parallel and distributed computation Parallel and distributed versions of clustering algorithms Conventional parallel master-slave programming approaches Map-reduce / Hadoop based approaches Algorithms specifically designed for large-scale clustering DBSCAN (including spatial indexing) BIRCH (balanced iterative reducing and clustering using hierarchies) Builds a compressed tree structure of clustering features (CF) in a CF-tree including measures for inter/intra cluster cohesion and separation CURE (clustering using representatives applying sampling and partitioning) CLARANS (clustering large applications based on K-medoids and randomized search) Parallel and distributed versions of earlier clustering algorithms PBIRCH parallel version of BIRCH ICLARANS parallel version of CLARANS PKMeans (parallel version of K-Means), K-Means++ (map reduce versions of K-Means) Parallel DBSCAN (parallel version of DBSCAN), MR-DBSCAN (map reduce versions)

119 Computational complexity of clustering algorithms (from book Clustering, Xu and Wunsch, 2009) TABLE 8.1. Computational complexity of clustering algorithms. Algorithms like BIRCH and WaveCluster can scale linearly with the input size and handle very large data sets. Cluster Algorithm Complexity Suitable for High Dimensional Data K - means O ( NKd ) (time) O ( N + K ) (space) No Fuzzy c - means Near O ( N ) No Hierarchical O ( N 2 ) (time and space) No clustering * PAM O ( K ( N K ) 2 ) No CLARA + O ( K (40 + K ) 2 + K ( N K )) (time) No CLARANS Quadratic in total performance No BIRCH O ( N ) (time) No DBSCAN O ( N log N ) (time) No CURE 2 ON ( sample log Nsample )( time) ON ( sample )( space ) Yes WaveCluster O ( N ) (time) No DENCLUE O ( N log N ) (time) Yes FC O ( N ) (time) Yes STING O (Number of cells at the bottom layer) No CLIQUE Linear with the number of objects, Quadratic Yes with the number of dimensions OptiGrid Between O ( Nd ) and O ( Nd log N ) Yes ORCLUS OK + KNd+ Kd space 2 OKd space # Yes ( )( ) ( 0 )( ) * Include single - linkage, complete - linkage, average - linkage, etc. + Based on the heuristic for drawing a sample from the entire data set (Kaufman and Rousseeuw, 1990 ) #K 0 is the number of initial seeds

120 Comparison of Clustering Techniques

121 BIRCH - balanced iterative reducing and clustering using hierarchies BIRCH is designed for clustering large amounts of numeric data by integrating hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage). It overcomes the two difficulties in agglomerative clustering methods: scalability and the inability to undo what was done in the previous step. BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-tree) to represent a cluster hierarchy. Achieves good performance and scalability (O(n)) in large or even streaming databases, and also make it effective for incremental and dynamic clustering of incoming objects. The ideas of BIRCH to use clustering features and CF-trees have been borrowed by many others to tackle problems of clustering streaming and dynamic data

122 The clustering feature (CF) of the cluster is a 3-D vector summarizing information about clusters of objects. It is defined as: CF = (n, LS, SS) where n is the no of data points, LS is the linear sum of the points and SS is the square sum of the points. The clustering feature can be used to derive many useful statistics of a cluster. For example, the cluster s centroid, c is given by: c = LS / n BIRCH - balanced iterative reducing and clustering using hierarchies Clustering features are also additive. That is, for two disjoint clusters, C1 and C2, with the clustering features CF1 = (n1, LS1, SS1) and CF2 = (n2, LS2, SS2), respectively, the clustering feature for the cluster formed by merging C1 and C2 is simply CF1 + CF2 = (n1+n2, LS1+LS2, SS1+SS2)

123 BIRCH - balanced iterative reducing and clustering using hierarchies A CF-tree is a height-balanced tree storing clustering features for a hierarchical clustering. A nonleaf node in a tree has a no of children and nonleaf nodes also store sums of the CFs of their children - summarizing clustering information about their children. A CF-tree has two parameters: branching factor, B, and threshold, T. The branching factor specifies the maximum number of children per nonleaf node. The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. These two parameters implicitly control the resulting tree s size. CF 1 CF 2 CF k Root level CF 11 CF 12 CF 1k First level

124 Outline of BIRCH BIRCH applies a multiphase clustering technique: A single scan of the data set yields a basic, good clustering, and one or more additional scans can optionally be used to further improve the quality

125 CURE - another hierarchical approach CURE Clustering Using Representatives Represents each cluster with multiple representative points from the cluster First point farthest from center and next as far away as possible from the previous set. Uses a variety of techniques to handle large data sets, outliers and clusters with non-spherical shapes and non-uniform sizes. Applies an agglomerative hierarchical clustering for the actual clustering into a desired no of clusters specified by user. Sampling and partitioning are applied to improve computational performance. Partitions a random sample of data points and performs hierarchical clustering on the partitioned sample. The clusters found in the initial clustering is then clustered into the final number of desired clusters. A final pass assigns each remaining point in the data set to the closest cluster, i.e. cluster with closest representative point

126 CURE: another hierarchical approach Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster by a shrinkage factor (typically between 0.2 and 0.7). Cluster similarity is the similarity of the closest pair of representative points from different clusters Shrinking representative points toward the center helps avoid problems with noise and outliers CURE is able to handle clusters of arbitrary shapes and sizes through the representation of the clusters by multiple representative points

127 Outline of the CURE algorithm Where m no of data points, p partitions, q is the desired reduction of points in a partition, i.e. no of clusters in a partition is m/pq, and K is the desired no of clusters

128 Experimental Results: CURE Picture from CURE, Guha, Rastogi, Shim

129 Experimental Results: CURE (centroid) (single link) Picture from CURE, Guha, Rastogi, Shim

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Outline Prototype-based Fuzzy c-means

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018) 1 Notes Reminder: HW2 Due Today by 11:59PM TA s note: Please provide a detailed ReadMe.txt file on how to run the program on the STDLINUX. If you installed/upgraded any package on STDLINUX, you should

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/004 What

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

COMP 465: Data Mining Still More on Clustering

COMP 465: Data Mining Still More on Clustering 3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following

More information

Lecture Notes for Chapter 8. Introduction to Data Mining

Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar (modified by Predrag Radivojac, 07) Old Faithful Geyser Data

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods Whatis Cluster Analysis? Clustering Types of Data in Cluster Analysis Clustering part II A Categorization of Major Clustering Methods Partitioning i Methods Hierarchical Methods Partitioning i i Algorithms:

More information

Unsupervised Learning Hierarchical Methods

Unsupervised Learning Hierarchical Methods Unsupervised Learning Hierarchical Methods Road Map. Basic Concepts 2. BIRCH 3. ROCK The Principle Group data objects into a tree of clusters Hierarchical methods can be Agglomerative: bottom-up approach

More information

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining 1 DATA MINING - 1DL105, 1Dl111 Fall 007 An introductory class in data mining http://user.it.uu.se/~udbl/dm-ht007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015 // What is Cluster Analysis? COMP : Data Mining Clustering Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, rd ed. Cluster: A collection of data

More information

Clustering Tips and Tricks in 45 minutes (maybe more :)

Clustering Tips and Tricks in 45 minutes (maybe more :) Clustering Tips and Tricks in 45 minutes (maybe more :) Olfa Nasraoui, University of Louisville Tutorial for the Data Science for Social Good Fellowship 2015 cohort @DSSG2015@University of Chicago https://www.researchgate.net/profile/olfa_nasraoui

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)......

d(2,1) d(3,1 ) d (3,2) 0 ( n, ) ( n ,2)...... Data Mining i Topic: Clustering CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Cluster Analysis What is Cluster Analysis? Types

More information

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1 Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

Clustering in Ratemaking: Applications in Territories Clustering

Clustering in Ratemaking: Applications in Territories Clustering Clustering in Ratemaking: Applications in Territories Clustering Ji Yao, PhD FIA ASTIN 13th-16th July 2008 INTRODUCTION Structure of talk Quickly introduce clustering and its application in insurance ratemaking

More information

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis 7 Supervised learning vs unsupervised learning Unsupervised Learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute These patterns are then

More information

Knowledge Discovery in Databases

Knowledge Discovery in Databases Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Lecture notes Knowledge Discovery in Databases Summer Semester 2012 Lecture 8: Clustering

More information

数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining 数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 S8700113C 1 Introduction Last week: Association Analysis

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber &

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: AK 232 Fall 2016 More Discussions, Limitations v Center based clustering K-means BFR algorithm

More information

CS Data Mining Techniques Instructor: Abdullah Mueen

CS Data Mining Techniques Instructor: Abdullah Mueen CS 591.03 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 6: BASIC CLUSTERING Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II Welcome to DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu Location: KH 116 Fall 2017 Updates: v Progress Presentation: Week 15: 11/30 v Next Week Office hours

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling To Appear in the IEEE Computer CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling George Karypis Eui-Hong (Sam) Han Vipin Kumar Department of Computer Science and Engineering University

More information

Lecture 7 Cluster Analysis: Part A

Lecture 7 Cluster Analysis: Part A Lecture 7 Cluster Analysis: Part A Zhou Shuigeng May 7, 2007 2007-6-23 Data Mining: Tech. & Appl. 1 Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information

Distance-based Methods: Drawbacks

Distance-based Methods: Drawbacks Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Marco BOTTA Dipartimento di Informatica Università di Torino botta@di.unito.it www.di.unito.it/~botta/didattica/clustering.html Data Clustering Outline What is cluster analysis? What

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Clustering Algorithms for Spatial Databases: A Survey

Clustering Algorithms for Spatial Databases: A Survey Clustering Algorithms for Spatial Databases: A Survey Erica Kolatch Department of Computer Science University of Maryland, College Park CMSC 725 3/25/01 kolatch@cs.umd.edu 1. Introduction Spatial Database

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Hierarchical Clustering Produces a set

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

Clustering Lecture 3: Hierarchical Methods

Clustering Lecture 3: Hierarchical Methods Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

CSE 347/447: DATA MINING

CSE 347/447: DATA MINING CSE 347/447: DATA MINING Lecture 6: Clustering II W. Teal Lehigh University CSE 347/447, Fall 2016 Hierarchical Clustering Definition Produces a set of nested clusters organized as a hierarchical tree

More information

Lecture 10: Semantic Segmentation and Clustering

Lecture 10: Semantic Segmentation and Clustering Lecture 10: Semantic Segmentation and Clustering Vineet Kosaraju, Davy Ragland, Adrien Truong, Effie Nehoran, Maneekwan Toyungyernsub Department of Computer Science Stanford University Stanford, CA 94305

More information

A Comparative Study of Various Clustering Algorithms in Data Mining

A Comparative Study of Various Clustering Algorithms in Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 7 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 0 0 0 00

More information

Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now

More information

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce

Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce Efficient Parallel DBSCAN algorithms for Bigdata using MapReduce Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1

Contents. List of Figures. List of Tables. List of Algorithms. I Clustering, Data, and Similarity Measures 1 Contents List of Figures List of Tables List of Algorithms Preface xiii xv xvii xix I Clustering, Data, and Similarity Measures 1 1 Data Clustering 3 1.1 Definition of Data Clustering... 3 1.2 The Vocabulary

More information