Unsupervised Learning I: K-Means Clustering

Size: px

Start display at page:

Download "Unsupervised Learning I: K-Means Clustering"

Juniper McGee
5 years ago
Views:

1 Unsupervised Learning I: K-Means Clustering Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

2 Unsupervised learning = No labels on training examples! Main approach: Clustering

3 Example: Optdigits data set

4 Optdigits features f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 Etc... x = (f 1, f 2,..., f 64 ) = (0, 2, 13, 16, 16, 16, 2, 0, 0,...)

5 Partitional Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

6 Partitional Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

7 Hierarchical Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

8 Hierarchical Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

9 Hierarchical Clustering of Optdigits Feature 1 Feature 2 Feature 3 64-dimensional space

10 Issues for clustering algorithms How to measure distance between pairs of instances? How many clusters to create? Should clusters be hierarchical? (I.e., clusters of clusters) Should clustering be soft? (I.e., an instance can belong to different clusters, with weighted belonging )

11 Most commonly used (and simplest) clustering algorithm: K-Means Clustering

12 Adapted from Andrew Moore,

13 Adapted from Andrew Moore,

14 Adapted from Andrew Moore,

15 Adapted from Andrew Moore,

17 K-means clustering algorithm

18 K-means clustering algorithm Typically, use mean of points in cluster as centroid

19 K-means clustering algorithm Distance metric: Chosen by user. For numerical attributes, often use L 2 (Euclidean) distance. d(x, y) = (x i y i ) 2 Centroid of a cluster here refers to the mean of the points in the cluster. n i=1

20 Example: Image segmentation by K-means clustering by color From K=5, RGB space K=10, RGB space

21 K=5, RGB space K=10, RGB space

22 K=10, RGB space K=5, RGB space

23 Clustering text documents A text document is represented as a feature vector of word frequencies Distance between two documents is the cosine of the angle between their corresponding feature vectors.

Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles. Boyack KW, Newman D, Duhon RJ, Klavans R, et al.

24 Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles. Boyack KW, Newman D, Duhon RJ, Klavans R, et al. (2011) Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches. PLoS ONE 6(3): e doi: /journal.pone

25 Exercise 1

26 How to evaluate clusters produced by K-means? Unsupervised evaluation Supervised evaluation

27 Unsupervised Cluster Evaluation We don t know the classes of the data instances Let C denote a clustering (i.e., set of K clusters that is the result of a clustering algorithm) and let c i denote a cluster in C. Let c i denote the number of elements in c i. Clustering C: c 2 c 1 c 2 = 6 c 1 = 9 c 3 c 3 = 6 How can we quantify how good a clustering this is?

28 We want to minimize the distance between elements of c and the centroid µ c. coherence of each cluster c i.e., minimize Mean Square Error (mse): c 1 c 2 mse(c) = d(x, µ c ) 2 x c c c 3 Average mse (C) = c C mse(c) K

29 We want to minimize the distance between elements of c and the centroid µ c. coherence of each cluster c i.e., minimize Mean Square Error (mse): c 1 c 2 mse(c) = d(x, µ c ) 2 x c c c 3 Average mse (C) = Note: The assigned reading uses sum square error rather than mean square error. c C mse(c) K

30 We also want to maximize pairwise separation of each cluster. c 1 c 2 c 3

31 We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss): mss (C) = d(µ i, µ j ) 2 all distinct pairs of clusters i, j C (i j) K(K 1) / 2 c 2 c 1 c 3

32 We also want to maximize pairwise separation of each cluster. That is, maximize Mean Square Separation (mss): mss (C) = d(µ i, µ j ) 2 all distinct pairs of clusters i, j C (i j) K(K 1) / 2 c 2 c 1 c 3

33 Exercises 2-3

34 Supervised Cluster Evaluation Suppose we know the classes of the data instances: black, blue, green c 1 c 2 c 3

35 Supervised Cluster Evaluation Suppose we know the classes of the data instances: black, blue, green Entropy of a cluster: The degree to which a cluster consists of objects of a single class. c 1 c 2 c 3

36 entropy(c i ) = where Classes p i, j j=1 log 2 p i, j p i, j = probability that a member of cluster i belongs to class j = m i, j m i, where m i, j is the number of instances in cluster i with class j and m i is the number of instances in cluster i c 1 c 3 c 2 entropy(c 1 ) = 3 9 log 2 entropy(c 2 ) = 1 6 log 2 entropy(c 3 ) = 4 6 log log log log log log 2 2 = = =

37 Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: mean entropy(c) = entropy(c i ) m K i=1 where m i is the number of instances in cluster c i and m is the total number of instances in the clustering. m i

38 Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: mean entropy(c) = entropy(c i ) m K i=1 where m i is the number of instances in cluster c i and m is the total number of instances in the clustering. m i c 1 c 3 c 2 entropy(c 1 ) = 3 9 log 2 entropy(c 2 ) = 1 6 log 2 entropy(c 3 ) = 4 6 log log log log log log 2 2 = = =

Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: mean entropy(c) = entropy(c i ) m K i=1 where m i is the number of

39 Mean entropy of a clustering: Average entropy over all clusters in the clustering, weighted by number of elements in each cluster: mean entropy(c) = entropy(c i ) m K i=1 where m i is the number of instances in cluster c i and m is the total number of instances in the clustering. m i c 1 c 3 c 2 entropy(c 1 ) = 3 9 log 2 entropy(c 2 ) = 1 6 log 2 entropy(c 3 ) = 4 6 log log log log log log 2 2 = = = mean entropy(c) = 9 21 (1.74) (0.78)+ 6 (0.45) =1.1 21

40 What is the mean entropy of this clustering? c 1 c 2 c 3

41 Entropy Exercise

42 Homework 5

43 Adapted from Bing Liu, UIC Issues for K-means

44 Adapted from Bing Liu, UIC Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values.

45 Adapted from Bing Liu, UIC Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values. The user needs to specify K.

46 Adapted from Bing Liu, UIC Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values. The user needs to specify K. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points.

47 Adapted from Bing Liu, UIC Issues for K-means The algorithm is only applicable if the mean is defined. For categorical data, use K-modes: The centroid is represented by the most frequent values. The user needs to specify K. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values.

48 Adapted from Bing Liu, UIC Issues for K-means: Problems with outliers CS583, Bing Liu, UIC

49 Adapted from Bing Liu, UIC Dealing with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. Expensive Not always a good idea!

50 Adapted from Bing Liu, UIC Dealing with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. Expensive Not always a good idea! Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

51 Adapted from Bing Liu, UIC Issues for K-means (cont ) The algorithm is sensitive to initial seeds. + + CS583, Bing Liu, UIC

52 Adapted from Bing Liu, UIC Issues for K-means (cont ) If we use different seeds: good results + + CS583, Bing Liu, UIC

53 Adapted from Bing Liu, UIC Issues for K-means (cont ) If we use different seeds: good results + + Often can improve K-means results by doing several random restarts. CS583, Bing Liu, UIC

54 Adapted from Bing Liu, UIC Issues for K-means (cont ) If we use different seeds: good results + + Often can improve K-means results by doing several random restarts. Often useful to select instances from data as initial seeds. CS583, Bing Liu, UIC

55 Adapted from Bing Liu, UIC Issues for K-means (cont ) The K-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). + CS583, Bing Liu, UIC

56 Other Issues What if a cluster is empty? Choose a replacement centroid At random, or From cluster that has highest mean square error How to choose K? The assigned reading discusses several methods for improving a clustering with postprocessing.

57 Choosing the K in K-Means Hard problem! Often no correct answer for unlabeled data Many proposed methods! Here are a few: Try several values of K, see which is best, via cross-validation. Metrics: mean square error, mean square separation, penalty for too many clusters [why?] Start with K = 2. Then try splitting each cluster. New means are one sigma away from cluster center in direction of greatest variation. Use similar metrics to above.

58 Elbow method: Plot average MSE vs. K. Choose K at which MSE (or other metric) stops decreasing abruptly. elbow However, sometimes no clear elbow

Unsupervised Learning

Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised