Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

Size: px

Start display at page:

Download "Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238"

Marylou Cameron
5 years ago
Views:

1 Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

2 What is Clustering? Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

3 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 What is Clustering? Clustering Class discovery Given a set of objects, group them into clusters (classes that are unknown beforehand) an instance of unsupervised learning (no training dataset) In Practice Cluster images to find categories Cluster patient data to find disease subtypes Cluster persons in social networks to detect communities

4 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 What is Clustering? Supervised versus unsupervised learning general inference problem: given x i, predict y i by learning a function f training set: set of examples (x i, y i ) where y i = f (x i ) (but f is still unknown!) test set: new set of data points x i where y i is unknown Supervised: use training data to infer your model, then apply this model to the test data Unsupervised: no training data, learn model and apply it directly on the test data

5 k-means Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

6 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Objective Partition the dataset into k clusters such that intra-cluster variance is minimised V (D) = k i=1 x j S i (x j µ i ) 2 (1) where V is the variance, S i is a cluster, µ i is its mean, D is the dataset of all points x j

7 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Lloyds algorithm (Lloyds, 1957) 1 Partition the data into k initial clusters 2 Compute the mean of each cluster 3 Assign each point to the cluster whose mean is closest to the point 4 If any point changed its cluster membership: Repeat from Step 2

8 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Example Example: Iteration 1

9 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Example Example: Iteration 2

10 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Example Example: Iteration 3

11 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Example Example: Iteration 4

12 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Number of Clusters Example: k = 2

13 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Number of Clusters Example: k = 3

14 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Number of Clusters Example: k = 4

15 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Number of Clusters Example: k = 5

16 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Number of Clusters Example: k = 6

17 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 1, Iteration 1

18 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 1, Iteration 5

19 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 1, Iteration 9

20 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 2, Iteration 1

21 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 2, Iteration 2

22 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 2, Iteration 3

23 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Effect of Initialization Example: Initialization 2, Iteration 4

24 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means Things to note k-means is still the state-of-the-art method for most clustering tasks When proposing a new clustering method, one should always compare to k-means. Lloyds algorithm has several setbacks It is order-dependent. Its results depends on the initialisation of the clusters. Its result may be a local optimum, not the global optimal solution.

25 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-centroid Brother of k-means Don t use the mean of each cluster but the medoid. The medoid is the point closest to the mean: m i = argmin xj S i x j µ i 2 One thereby restricts the cluster means to points that are present in the dataset. One only minimises variance with respect to these points.

26 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Kernel k-means Kernelised k-means? It would be attractive to perform clustering using kernels can move clustering problem to different feature spaces can cluster string and graph data But we have to be able to perform all steps in k-means using kernels!

27 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Kernel k-means Kernelised k-means The key step in k-means is to compute the distance between one data point x 1 and the mean of a cluster of points x 2,..., x m : φ(x 1 ) k(x 1, x 1 ) 1 (m 1) m φ(x j ) 2 = j=2 2 (m 1) m 1 k(x 1, x j ) + (m 1) 2 j=2 m i=2 j=2 This result is based on the fact that every kernel k induces a distance d: d(x i, x j ) 2 = φ(x i ) φ(x j ) 2 =k(x i, x i ) 2k(x i, x j )+k(x j, x j ) m k(x i, x j ) (2)

28 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 k-means: Silhouette Coefficients How to set k in k-means A silhouette coefficient s(x) (Rousseeuw, 1987) relates the average distance between a point x and and all others points from its cluster C, d(x, µ C ), to the average distance between a point x and the other points from the second nearest cluster C, d(x, µ C ): s(x) = d(x, µ C ) d(x, µ C ) max(d(x, µ C ), d(x, µ C )) Interpretation of s(x) : s(x) is close to 1, if a point is clearly located in its cluster C. s(x) is close to 0, if a point is located between two clusters. s(x) is negative, if it is closer to cluster C than to its current cluster.

29 k-means: Silhouette Coefficients Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

30 k-means: Silhouette Coefficients Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

31 k-means: Silhouette Coefficients Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

32 Graph-based Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

33 Graph-based Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

34 Graph-based Clustering Data representation dataset D is given in terms of a graph G = (V, E) a data object v i is a node in G edge e ij from node v i to node v j has weight w ij Graph-based clustering Define a threshold θ Remove all edges e ij from G with weight w ij > θ. Each connected component of the graph now corresponds to one cluster. Two nodes are in the same connected component if there is a path between them. Graph components can be found by depth-first search in a graph (O( V + E )). Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

35 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Graph-based Clustering Original graph

36 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Graph-based Clustering Thresholded graph (θ = 0.5)

37 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Graph-based Clustering But how to get the graph in the first place? Some domains have a natural graph structure, e.g. telecommunication or social networks Otherwise, obtain the graph structure through a distance function d on the vertices by: connecting each node to its k-nearest neighbors, connecting each node to all nodes in an ɛ-neighborhood.

38 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan Noise-robust graph-based clustering Graph-based clustering can suffer from the fact that one noisy edge connects two clusters. DBScan (Ester et al., 1996) is a noise-robust extension of graph-based clustering. DBScan is short for Density-Based Spatial Clustering of Applications with Noise. Core object Two objects v i and v j with distance d(v i, v j ) < ɛ belong to the same cluster if either v i or v j are a core object. v i is a core object iff there are MinPoints points within a distance of ɛ from v i. A cluster is defined by iteratively checking this core object property.

39 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan DBScan-clustered graph (MinPts = 2, Eps = 0.5)

40 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan DBScan-clustered graph (MinPts = 3, Eps = 0.5)

41 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan DBScan-clustered graph (MinPts = 3, Eps = 0.5)

42 DBScan Pseudocode I Code: Main DBSCAN (SetOfPoints, Eps, MinPts) // SetOfPoints is UNCLASSIFIED ClusterId := nextid(noise); for i FROM 1 TO SetOfPoints.size do Point := SetOfPoints.get(i); if Point.ClId = UNCLASSIFIED then if ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then ClusterId := nextid(clusterid) Code: ExpandCluster ExpandCluster(SetOfPoints, Point, ClId, Eps, MinPts): Boolean; seeds:=setofpoints.regionquery(point,eps); if seeds.size < MinPts then SetOfPoint.changeClId(Point,NOISE); Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

43 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan Pseudocode II RETURN False; else SetOfPoints.changeClIds(seeds,ClId); seeds.delete(point); while seeds <> Empty currentp := seeds.first(); result := SetOfPoints.regionQuery(currentP, Eps); if result.size >= MinPts then for i FROM 1 TO result.size do resultp := result.get(i); if resultp.clid IN (UNCLASSIFIED, NOISE) then if resultp.clid = UNCLASSIFIED then seeds.append(resultp); end if

44 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan Pseudocode III SetOfPoints.changeClId(resultP,ClId); end if // UNCLASSIFIED or NOISE end for; end if; // result.size >= MinPts seeds.delete(currentp); end while; // seeds <> Empty RETURN True; end if end // ExpandCluster

45 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 DBScan Properties Cluster assignment of border points is order-dependent Unlike k-means, one does not have to specify the number of clusters a priori But one has to set MinPts and Eps Ester et al. report that for 2D examples MinPts=4 is sufficient for good results They determine Eps by visual inspection of a k-distance plot Transfer question: How to kernelise DBScan?

46 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering based on: Mohammed J. Zaki and Wagner Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press (2014), Chapter 16

47 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Concept Spectral Clustering connects graph-based clustering with k-means. But what is the link between these two approaches? To understand this link, we must first familiarize ourselves with cut-based clustering.

Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2015 210 / 238

48 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Cut-based clustering 0.7$ 0.8$ 0.8$ 0.2$ 0.1$ 0.2$ 0.9$ 0.8$ 0.8$ 0.8$ 0.9$ 0.7$

49 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Cut-based clustering Objects are nodes in a graph G with nodes V and edges E. The adjacency matrix W represents the similarities between pairs of nodes. Assume V is partitioned into k subsets: V = {C 1,..., C k }. Cut-based clustering tries to minimize the total weight of inter-cluster edges: min 1 2 k a=1 b=1 k κ(c a, C b ), where κ(c a, C b ) = v i C a,v j C b,a b W ij and κ(c a, C a ) = 0.

50 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Link to the Graph Laplacian The degree matrix D is defined as D ij = { n j=1 W ij i = j 0 i j The (unnormalized) Graph Laplacian is defined as L = D W.

51 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Link to the Graph Laplacian { 1 if vi C Let c a be a vector of size n such that c a (i) = a 0 if v i / C a Note that c a, c a = c a 2 = C a, the size of cluster C a. Furthermore, note that all c a and c b are orthogonal, that is, c a, c b = 0.

52 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Link to the Graph Laplacian c a Lc a = c a (D W )c a = (3) = c a (i)c a (j)(d ij W ij ) = (4) i,j = n c a (i)( W il ) c a (i)c a (j)w ij = (5) i l=1 i,j = W ij W ij = W ij. (6) v i C a,v j C b v i C a,v j C a v i C a,v j C b,a b = κ(c a, C b ) (7) b

53 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Link to the Graph Laplacian It follows that finding the minimum k-cut is identical to minimizing min 1 2 k c a Lc a a=1

54 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Ratio Cut Minimum k cut clustering is prone to finding very small clusters. Ratio Cut accounts for this problem by dividing the cut size by the size of the cluster: min C k a=1 1 C a k κ(c a, C b ) b=1 This can be rewritten as min C k a=1 c a Lc a c a 2.

55 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Ratio Cut Finding the optimal C, that is the optimal binary cluster indicator vectors c a for a {1,..., k}, is NP-hard. We therefore relax the solution and allow the vectors c a to take any real value, rather than being binary. The objective can then be rewritten as where u a = k ( c a c a ) L( c a c a ) = a=1 k u a Lu a, (8) a=1 ca c a is the unit vector in the direction of vector c a R n.

56 Spectral Clustering Ratio Cut To find the optimal u, we first incorporate the constraint u a u a = 1 via a Lagrange multiplier λ into Objective (8): k u a Lu a + a=1 k λ a (1 u a u a ) a=1 Setting the derivative with respect to u i to zero implies: u i ( k u a Lu a + a=1 k λ a (1 u a u a )) = 0 (9) a=1 2Lu i 2λ i u i = 0, (10) Lu i = λ i u i (11) Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

57 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Ratio Cut This implies that all u a are eigenvectors of L. Using Equation (11), it follows that u a Lu a = u a λ a u a = λ a This in turn implies that in order to minimize Objective (8), one should choose the k smallest eigenvalues of L and their corresponding eigenvectors. The eigenvectors represent the relaxed cluster indicator vectors (excluding u n ).

58 Spectral Clustering Concept (Donath and Hoffman, 1973, Shi and Malik, 2000 and Ng, Jordan, and Weiss, 2002) Spectral Clustering solves this problem pragmatically by using the vectors u a as a new representation of the data points and applying k-means to this new representation after normalization. The new representation is U = (u n, u n 1... u n k+1 ). It is normalized row-by-row to obtain the new k-dimensional representation y 1 Y = y2..., via y 1 i = k (u n,i, u n 1,i,..., u n k+1,i ). (12) yn j=1 u2 n j+1,i Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

59 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Pseudocode procedure Spectral Clustering(D, k) Compute the similarity matrix W R n n and the Laplacian L = D W ; Solve Lu a = λ a u a for a = n,..., n k + 1, where λ n λ n 1... λ n k+1 U := (u n, u n 1,..., u n k+1 ) Y := normalized rows of U via Equation (12) C := {C 1,..., C k } via k-means on Y return: C

60 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Spectral Clustering Computational Complexity The overall worst case runtime is O(n 3 ) due to the need to computer eigenvectors and eigenvalues. For sparse graphs with m edges, this runtime can be improved to O(mn). Running k-means requires a runtime in O(tnk 2 ), where t is the number of iterations of k-means until convergence.

61 Soft-assignment Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

62 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering Soft k-means k-means is based on a hard assignment of points to clusters. Wouldn t it be more realistic to work with the probabilities of each point to belong to each cluster rather than a hard assignment? This is the core idea of Expectation Maximization (EM) Clustering with a Mixture of Gaussian distributions. It is also referred to as soft k-means.

63 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering The General EM algorithm (Dempster et al., 1977) We are dealing with observed variables (objects X and their features) and latent variables (the cluster membership of the objects Y ), and model parameters θ (parameters of the underlying probability distribution). We would like to maximize p(x θ). This optimization is difficult as log p(x θ) = log{ Y p(x, Y θ)}, that is, we have to sum over the latent variables inside the logarithm, which makes the evaluation of the maximum likelihood extremely challenging. The EM algorithm circumvents this problem in an iterative 2-step procedure.

64 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering The General EM algorithm (Dempster et al., 1977; Source: Bishop, 2006, Chapter 9.2) Given a joint distribution p(x, Y θ) over observed variables X and latent variables Y, governed by parameters θ, the goal is to maximize the likelihood function p(x θ) with respect to θ: 1 Choose an initial setting for the parameters θ old. 2 Expectation step (E step): Evaluate p(y X, θ old ). 3 Maximization step (M step): Evaluate θ new given by θ new = arg max Q(θ, θ old ), θ where Q(θ, θ old ) = Y p(y X, θold ) log p(x, Y θ). 4 Check for convergence of parameters or log likelihood. If not converged, then θ old θ new and return to Step 2. EM may converge to a local optimum.

65 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering Concept Consider the case of a mixture of k Gaussians in which θ is a triplet (c, {µ 1,..., µ k }, {Σ 1,..., Σ k }), where P θ [Y = y] = c y and P θ (t)[x = x Y = y] = N (x µ y, Σ y ) = 1 (2π) d 2 Σ y 1 2 e 1 2 ((x µ y ) Σ 1 y (x µ y )) For simplicity, we assume that Σ 1 = Σ 2 =... = Σ k = I, where I is the identity matrix.

66 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering Expectation step For each i {1,..., n} and y {1,..., k}, we have that P θ (t)[y = y X = x i ] = 1 P Z θ (t)[y = y]p θ (t)[x = x i Y = y] = 1 c y (t) exp( 1 i Z i 2 x i µ (t) y 2 ), where Z i is a normalization factor which ensures that Σ y P θ (t)[y = y X = x i ] sums to 1.

67 EM Clustering Maximization step We need to maximize the following expression with respect to µ and c to obtain θ (t+1) : n k P θ (t)[y = y X = x i ](log(c y ) 1 2 x i µ y 2 ). i=1 y=1 Setting the derivative wrt µ y of the term above to zero and rearranging terms, one obtains: µ y = n i=1 P θ (t)[y = y X = x i]x i n i=1 P θ (t)[y = y X = x i] µ y is the weighted average of the x i, where the weights are according to the probabilities calculated in the E step. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

68 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering Maximization step The maximal c can be shown to be c y = n i=1 P θ (t)[y = y X = x i] k n y =1 i=1 P θ (t)[y = y X = x i ]

69 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 EM Clustering Comparison to k-means k-means assigns each point to a cluster according to the distance to the cluster means. Then the cluster means are updated based on the examples in this cluster. EM determines the probability that each example belongs to each cluster. Then the cluster means are updated based on a weighted sum over all data points.

70 Hierarchical Clustering Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

71 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Hierarchical Clustering Extension of original setting What if clusters contain clusters themselves? Then we need hierarchical clustering!

72 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Hierarchical Clustering Join the most similar clusters Iteratively join the two most similar clusters But how to measure similarity between clusters? Similarity of clusters (Florek et al., 1951) Single Link: s(c i, C j ) = min d(x, x ) x C i,x C j Average Link: s(c i, C j ) = 1 C i C j d(x, x ) x C i,x C j Complete Link: s(c i, C j ) = max d(x, x ) x C i,x C j

73 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Hierarchical Clustering Pseudocode procedure Hierarchical Clustering(D, s) Initialize each point x i D as its own cluster C i, for i {1,..., n}. repeat (i, j ) = arg min (i,j) s(c i, C j ) Merge clusters C i and C j. until C = 1

74 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Hierarchical Clustering Example: Dendrograms as a Way to Represent Hierarchical Clusterings

75 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Hierarchical Clustering Advantage Its clustering reflects the entire structure of the dataset. Disadvantage It is difficult to make a clear statement about cluster membership in hierarchical clustering, as each point belong to a hierarchy of clusters. Stopping Hierarchical Clustering early is an approach to circumvent this cluster-assignment problem. These criteria could be to stop the merging of clusters once a pre-defined number of clusters k has been reached. once the distance between the closest clusters exceeds a threshold ɛ.

76 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238 Clustering Summary Clustering partitions a dataset into groups of similar objects. The three most popular families of clustering algorithms are 1 k-means clustering, 2 graph-based clustering, 3 hierarchical clustering. When applying these algorithms, it is essential to be aware of the strengths and weaknesses of these algorithms and to report the exact parameter settings used (e.g. number of clusters, distance function used).

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other