k-means Clustering (pp )

Size: px

Start display at page:

Download "k-means Clustering (pp )"

Chloe Barton
5 years ago
Views:

1 Notation: Means pencil-and-paper QUIZ Means coding QUIZ k-means Clustering (pp ) Explaining the intialization and iterations of k-means clustering algorithm: Let us understand the mechanics of k-means on a 1-dimensional example 1 : This is the random initialization of 2 clusters (k=2): Show how the points are assigned to the clusters! 1 Unlike higher-dimensional cases, data points in 1D can be sorted, which makes the clustering problem easier. k-means is not applied to 1D data in real-life. In fact, clustering in 1D is usually called by a different name, segmentation. An example of 1D clustering algorithm is Jenks natural breaks optimization.

2 Calculate the new cluster centers! Show how the points are assigned to the new clusters! Calculate the cluster centers in the next iteration! Iterate until the centers are stable!

3 Now a similar problem in 2D: Show how the points are assigned to the clusters! Calculate the coordinates of the new cluster centers! Iterate until the centers are stable!

) Use Numpy vectorized operations to prove that the two arrays above are identical.

4 Each cluster lies within its (linear) boundaries, which can later be used for classification: How to do our own clustering with Scikit-learn: How to use the trained clustering model to classify new points (the model is not changed by the new points!) Use Numpy vectorized operations to prove that the two arrays above are identical. Use the model trained above to classify these three data points: (-1, 42) (1, -42), (42, 42). Make only one call to predict( )! The attribute cluster_centers_ stores the coordinates of the centers: We can print the data points and the centers on the same plot using mglearn:

5 To do for next time: Try more or fewer cluster centers (p.174) Read Failure cases (pp.175-8) and try out the code therein. Read more about shortcomings of k-means here: (linked on our webpage). Update memory-sheet!

6 Solutions:

7 Algorithm development case-study: Implementing the k-means clustering algorithm 2 In this section, we are going to apply Python tools to implement and refine the k-means clustering algorithm in 1, 2, and 3 dimensions. Although we already know that k-means is not used in the one-dimensional (1D) case, we start here because it is a simpler problem: A company wants to distribute goods to 30 cities located on the same highway. To do this, the company will build a number k (initially unknown) of distribution centers along the highway, and each center will serve a number of neighboring cities. The centers themselves need not be in a city, they can be placed anywhere on the highway. What is the optimal placement of the centers? Whenever we ask for an optimal solution, we need to specify what exactly we want optimized: in general, we need a function whose minimum or maximum must be attained. For this problem, a good candidate is the sum of all distances from centers to the cities they serve 3, but, in order to make the math more tractable, we shall adopt a slightly different function, the sum of the squares of those distances; borrowing a concept from Coding Theory, this is called the distortion: k 1 1 D = n 2 i centeri, cityij i 0 j 0 d, where n i is the number of cities served by center i. (*) The 1D data is simply the distance of each city from one end of the highway 4. The instructor will provide the starter code shown here: Note that the set of data points is sorted, a simplification that will not be possible in higher dimensions. We now find the number of points, and decide on the number of centers/clusters: We have to find initial candidates for the cluster centers. With 1D data, this is an easy task, due to the sorted order: we split the highway into k equal segments, and take the middle of each segment as initial center. The code is shown below: 2 This section is not in our text. 3 Assuming the trucks drop off goods to only one city on each trip 4 This is Interstate 90, with Seattle (0 miles) at the Western end and Boston (3053 miles) at the Eastern end; distances were obtained from Google maps.

8 As mentioned, multi-dimensional data cannot be sorted, so we develop an alternate method for initialization: initial centers are chosen at random: Warm-up problems (solve in a separate.py file): Create a list of 10 random elements, using one of the random generators in Python s random module: random( ), uniform( ), triangular( ). Apply the built-in function max( ), which returns the maximum (largest) element in its argument. How can we also find the index of the maximum? Hint: There are at least two possible solutions, one using the method index( ), another using the built-in function enumerate( ). One iteration of the k-means algorithm consists of: For each data point, find the closest center. Label the data point with an identifier for the center. Group the data points into k sets, based on the labels; each set contains points assigned to the same center, the one closest to them. This is classification. Find the centroid (center of mass ) for each set. The centroids become the new centers. Let us implement these steps: In the code above, the centers are numbered 0 to k-1, and these numbers are used as labels. The list closest, of length n, is used to store the number of the closest center. We use the built-in functions abs( ), which

The result is that the list closest now contains for each data point the corresponding number of the closest center, as seen in the output.

9 returns the absolute value of a scalar argument, and min( ), which returns the minimum of its list argument. Note that the label in closest is the number of the center, whereas the minimum is calculated as a distance; this is why we use the index( ) method to find where the minimum occurred. The result is that the list closest now contains for each data point the corresponding number of the closest center, as seen in the output. Once the points have been classified into sets with respect to the current centers, we calculate the centroid for each set. Centroids are averages, so we need two lists: one for the sums, and one for the counts: It can be proven mathematically that the distortion function D defined in (*) above is reduced (or stays the same) at each iteration, and therefore the k-means algorithm is guaranteed to converge to a minimum. Unfortunately, that may not be the global minimum, as seen by running the algorithm multiple times with different initial centers 5 : There is no known method guaranteed to find the global minimum, short of exhaustive search. In practice, we just run the algorithm multiple times and take the solution with the minimum distortion. In the output above, the distortion is denoted D_sqr, but we also included D_abs, the sum of only the absolute values of distances to centers; intuitively, we feel that a low value for D_sqr also means a low value for D_abs, but we can see that this is not always the case. In practice, however, the difference between the two distortions is small enough to neglect. This is how we plot our 30 points on a line, using matplotlib and pyplot: 5 This is the more advanced algorithm, with multiple iterations, which will be developed subsequently.

10 Lab 6 - see handout The starter code for this lab is available on our webpage, and/or can be provided by the instructor. The file is called 16_kmeans_random_centers_1_iteration.py After the lab is completed, the final code will also be available on our webpage, and/or provided by the instructor. This code will be the starting point for the work in the next session (below) We start this session from the final version of the code for k-means, developed in the previous lab. If you don t have it, it can be obtained from our webpage, or from the instructor. The file is called 17_kmeans_loop_threshold.py 6 Discussion of the color list.what happens when we increase the number of clusters to 4? 5? 6? What are we supposed to do to fix the problem? Now revert k back to 3 until the end of today s session. We mentioned that the algorithm may not converge to the global optimum. Change the random seed used in the program to several different values and see how the final centroids change. The question then is: Which of the different solutions obtained above is better? To answer it, we have to first agree on a definition, or metric, for better. One possible solution is to use the sum of the absolute values, as we did when we measured the change in the centroids positions. Or we can take the average, by dividing the sum above by the number of clusters. The metric that works best in higher dimensions, however, is the sum of squared distances, a.k.a. distortion, introduced earlier. For convenience, the definition is repeated here: k 1 1 D = n 2 i centeri, cityij i 0 j 0 d, where n i is the number of cities served by center i. (*) Modify the code to calculate and display the value of D of the solution. Write a Python function to do this (What arguments does this function need?) To distinguish the distortion from the sum of absolute values, name your function D_sqr( ). Modify the code to run in a for loop, with different random seeds (or drop the random.seed() function call altogether). Keep track of the best solution, i.e. the one with the minimum distortion, and display it at the end. 6 If you downloaded it from the webpage, it probably has the extension.txt, which you have to change to.py.

11 Hint: Use a new list that stores closest, centroids, and D_abs, for the current best. At the end, display and plot only the best solution. Questions for the class: What was the minimum D_abs for k = 3 clusters? What are the centroids positions corresponding to it? Do you think that increasing the number of clusters from 3 to, say, 4 or 5 will increase of decrease the distortion? After you have answered this question, experiment and see if you got it right! We start this session from the final version of the code for k-means, developed previously. If you don t have it, it can be obtained from our webpage, or from the instructor. The file is called 18_kmeans_distortion.py 7 First, let us make some small improvements: A. We noticed last time that, especially when working with larger values for the number of centers k, some random choices for the centers resulted in an error condition, but the call to exit(0) didn t work as expected. We fix this by calling the stronger version of exit, from the sys module: B. The creation of the list colors for plotting can be made more efficient using a dictionary and a list comprehension: 7 If you downloaded it from the webpage, it probably has the extension.txt, which you have to change to.py.

12 What happens now if we try to use more than 5 centers? Where else in the program can we use color_dict? C. Finally, it is possible to include custom text information in matplotlib pyplots, with the method annotate( ), as shown here: The first argument is the text to be displayed, provided as a string. The second is the location (xy coordinates) of the tip of the arrow, provided as a tuple. The third is the location (xy coordinates) where the text starts to be displayed, also as a tuple. The fourth, arrowprops allows us to customize the properties of the arrows 8. Now we do a major refactoring of the program, by replacing two (potentially) long lists with Numpy arrays: points becomes arr_points closest becomes arr_closest Although distances is also of the same size, we are leaving it for the lab! The function d_sqr( ) is also left for the lab! Make only one small change at a time, and test the program after each change! (Make sure you obtain exactly the same result!) Whenever possible, use Numpy-specific operations! 8 See

13 Solutions: What happens now if we try to use more than 5 centers? A: We get Key Error, because the key 5 doesn t exist in the dictionary color_dict. Where else in the program can we use color_dict? A: In the scatter plot with the centers, we cna use the values in this dictionary:

14 Lab 7 - see handout The starter code for this lab is available on our webpage, and/or can be provided by the instructor. The file is called 20_kmeans_NUMPY.py After the lab is completed, the final code will also be available on our webpage, and/or provided by the instructor. Failure cases of k-means (pp.175-8) A. The number of clusters is a hyper-parameter. There is no universal algorithm to find the best number. Many algorithms exist, like the elbow method, Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC) B. Clusters are defined by diameter only, there is no way to account for density. Here is the text example: C. Clusters are circular, there is no way to account for direction. Here is the text example:

15 In general, there is no way to account for cluster shape: D. k-means will find clusters even in non-clustered data 9 : 9 Example from

16 E. k-means is very sensitive to scale 10 : F. Even on perfect data, k-means can get stuck in a local minimum 11 : 10 Example from 11 Example from

17 Relationship between k-means and vector quantization (pp ) PCA and other dimensionality reduction algorithms express the original data points (vectors) as a sum of directions (parts, components) whose number is lower than the original dimensionality. k-means can be viewed as a dimensionality reduction algorithm where each point is represented by just one component: the center of a cluster. Vector quantization is a technique from signal processing, originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms 12. The text code on p.179 shows a comparison between the first few PCA components and the first few clusters obtained with k-means on the dataset Labeled Faces in the Wild PCA was done with n_components=100, and k-means with n_clusters = 100. (Ignore the NMF decomposition, since we didn t cover it): And here are a few reconstructions. For k-means, the reconstruction is simply the closest center: 12

Example: The Two Moons dataset has only 2 dimensions, so PCA

18 An advantage of k-means over PCA: We are not restricted to the number of dimensions of the original dataset. Example: The Two Moons dataset has only 2 dimensions, so PCA is no help, but k-means can use, say, 10 clusters, which means 10 features:

19 Would the moon shapes be preserved with fewer than 10 clusters? Experiment to find out! Aside from the cluster centers themselves, the transform( ) method of kmeans allows easy access to the distance between each point and its center: Complexity: O(kndi), where i is the number of iterations, and d is the number of features (dimension of fata point). End k-means Next, we cover a clustering algorithm that offers support in choosing the right number of clusters.

20 Agglomerative Clustering (AC) (pp.184-9) Pseudocode: Start by declaring each point its own cluster. Repeat: Merge the two closest clusters Until a certain objective is reached. The Big-Oh complexity of AC (Not in text) The nr. of data points is n and the number of features is d. Remark: The complexity depends on the exact way in which the distance between clusters is measured. Since clusters are merged or linked, the distance is called the linkage. The discussion in our text implies the so-called single-linkage, a.k.a. nearest-neighbor distance; it is the minimum distance between all pairs of points in those clusters. A relatively straightforward algorithm for single-linkage uses an nxn proximity matrix to keep track of the distances between clusters at each step. It can be shown to have time complexity O(n 2 d + n 3 ) = O(n 3 ). Faster, but more intricate algorithms exist for single-linkage 13, that are only O(n 2 ). Interestingly, the Scikit-learn implementation of AC does not use single-linkage; see below for details. 13

Using AC in Scikit-learn Unlike KMeans, AgglomerativeClustering does not have a predict( ) method, because the clusters do not naturally divide the space among themselves.

above) Nothing prevents us, of course, from using the labels to create our own prediction algorithm.

22 Using AC in Scikit-learn Unlike KMeans, AgglomerativeClustering does not have a predict( ) method, because the clusters do not naturally divide the space among themselves. We can either: apply predict( ) and then use the information in the attribute labels_, as with KMeans, or apply fit_predict( ) in one step, which returns the cluster labels (as shown in the example above) Nothing prevents us, of course, from using the labels to create our own prediction algorithm. As an example, we can find the centroid of each cluster, and then use the distance of a new point to those centroids, k-meansstyle. The two most important hyper-parameters are: n_clusters The number of clusters where the bottom-up process stops. Its default value is 2. linkage How the distance between pairs of clusters is calculated, in order to decide which two will be merged next. Here is the official documentation 14 : It is also possible to override the blind linkage if we have extra information about the structure of the data. In the parameter connectivity, we can specify a matrix of neighboring data points. 14

23 The dendrogram It cannot be drawn (yet) in Scikit-learn, but it is easy to use the original Scipy: We can then use the usual matplotlib.pyplot tools to show cuts at various inter-cluster distances: Note: gca get current (figure s) axes (or create a new figure) The vertical segments in the dendrogram show how far apart the clusters are. Intuitively, we want to stop ( cut the dendrogram) at the largest branches, as shown above.

24 Conclusions on AC: A little slower than k-means in practice, but still scales well. If we manually create a connectivity matrix based on the structure of the dataset, it can capture complex shapes of clusters... But it does nto do it out of the box. It still has no concept of the density of the data points. (But the next and last clustering algorithm we present below does capture density!) DBSCAN The acronym stands for Density-Based Spatial Clustering of Applications with Noise. Idea: Clusters are regions of (relatively) high density of points, separated by regions of (relatively) low density. More detailed idea: Identify core samples, then extend the cluster by finding other core samples. What happend? The special label -1 means no cluster, i.e. all points are classified as noise! Normal labels start at 0, as usual. We have to tweak two important hyper-parameters: min_samples (default 5) and eps (epsilon, default 0.5):

25 Performance on the two-mon dataset:

26 Big-Oh complexity: O(n 2 ), but can be brought down to O(n log n) Conclusions on DBSCAN: Pros: Does not need the nr. of clusters as hyper-parameter! Models density! Can find arbitrarily-shaped clusters! Robust to outliers - will leave a data point outside of any cluster if it is too far! In practice, it is a little slower than k-means and AC, but not by much. Cons: Does not do well if the dataset has regions with large differences in density. Choosing a good threshold requires good understanding of the scaling of the data. Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed 15. Read FYI: Comparing and Evaluating Clustering Algorithms Homework #4 was assigned 15

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation