Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme
Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some examples are A company wants to find companies that are similar to their best business customers, in order to have the sales staff look at them as prospects. Modern retailers such as Amazon and Netflix use similarity to provide recommendations of similar products or from similar people. Whenever you see statements like People who like X also like Y or Customers with your browsing history have also looked at similarity is being applied. A doctor may reason about a new difficult case by recalling a similar case (either treated personally or documented in a journal) and its diagnosis.
Similarity and Distance The closer two objects are in the space defined by the features, the more similar they are. Consider two instances of credit card application. We want to find similarity between the two cases. So, our objective is to convert 3 dimensions (Age, Years at current address and Residential status) into distance There are many ways to measure similarity between, Simplest way of doing so is by using the Euclidean distance
distance Distance measures Minkowski distance of order p between two points x = (x 1, x 2,, x n ) and y = (y 1, y 2,, y n ) is given as: d(x, y) = n i=1 x i y i p - The 1-norm distance is called Manhattan distance and the 2- norm distance is the Euclidean distance. 1/p (y 1, y 2 ) (x 1, x 2 ) Manhattan
Similarity and Distance When an object is described by n features, n dimensions (d1, d2,, dn), the general equation for Euclidean distance in n dimensions is as below: So, the distance between two persons A and B can be calculated as below: So the distance between these examples is about 19. This distance is just a number it has no units, and no meaningful interpretation. It is only really useful for comparing the similarity of one pair of instances to that of another pair
Clustering Clustering is another application of our fundamental notion of similarity. The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the objects within groups are similar, but the objects in different groups are not so similar. E.g. Revisiting our whiskey example. Let s say that we run a small shop in a well-to-do neighborhood, and as part of our business strategy we want to be known as the place to go for single-malt scotch whiskeys. We may not be able to have the largest selection, given our limited space and ability to invest, but we might choose a strategy of having a broad and eclectic collection. If we understood how the single malts grouped by taste, we could (for example) choose from each taste group a popular member and a lesserknown member. Or an expensive member and a more affordable member.
Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used by the method its implementation, and Its ability to discover some or all of the hidden patterns
Hierarchical Clustering Consider the figure. Here 6 points have been grouped in clusters based on their similarity calculated using Euclidean distance Notice that the only overlap between clusters is when one cluster contains other clusters. Because of this structure, the circles actually represent a hierarchy of clusterings. The most general (highest-level) clustering is just the single cluster that contains everything cluster 5 in the example. The lowest-level clustering is when we remove all the circles, and the points themselves are six (trivial) clusters.
Hierarchical Clustering This graph is called a dendrogram, and it shows explicitly the hierarchy of the clusters. Along the x axis are arranged the individual data points. The y axis represents the distance between the clusters At the bottom (y = 0) each point is in a separate cluster. As y increases, different groupings of clusters fall within the distance constraint: first A and C are clustered together, then B and E, then the BE with D, and so on, until all clusters are merged An advantage of hierarchical clustering is that it allows the data analyst to see the Groupings, before deciding on the number of clusters to extract.
Centroid based clustering The most common way of representing clusters is through its center, called the centroid In the figure, we have three clusters whose instances are represented by the circles. Each cluster has a centroid, represented by the solid-lined star. The star is not necessarily one of the instances; it is the geometric center of a group of instances.
K-means clustering The most popular centroid-based clustering algorithm is called k- means clustering In k-means the means are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster So in previous figure, to compute the centroid for each cluster, we would average all the x values of the points in the cluster to form the x coordinate of the centroid, and all the y values to form the y coordinate The k in k-means is simply the number of clusters that one would like to find in the data
K-means clustering The k-means algorithm starts by creating k initial cluster centers, usually randomly, but sometimes by choosing k of the actual data points, or by being given specific initial starting points by the user As shown in previous figure, the clusters corresponding to these cluster centers are formed, by determining which is the closest center to each point. Next, for each of these clusters, its center is recalculated by finding the actual centroid of the points in the cluster. The cluster centers typically shifts as shown in the next figure This procedure keeps iterating until there is no change in the clusters
K-means clustering Above, the figure to the left shows a data set of 90 points The figure to the right shows the final results of clustering after 16 iterations. The three (erratic) lines show the path from each centroid s initial (random) location to its final location.
K-means clustering There is no guarantee that a single run of the k-means algorithm will result in a good clustering. The result of a single clustering run will find a local optimum a locally best clustering but this will be dependent upon the initial centroid locations. For this reason, k-means is usually run many times, starting with different random centroids each time. The results can be compared by examining the clusters or by a numeric measure such as the clusters distortion, which is the sum of the squared differences between each data point and its corresponding centroid. The clustering with the lowest distortion value can be deemed the best clustering.
Determining value of k A common concern with centroid algorithms such as k-means is how to determine a good value for k. One answer is simply to experiment with different k values and see which ones generate good results. The value for k can be decreased if some clusters are too small and overly specific, and increased if some clusters are too broad and diffuse For a more objective measure, the analyst can experiment with increasing values of k and graph various metrics of the quality of the resulting clusters. As k increases the quality metrics should eventually stabilize or plateau, either bottoming out if the metric is to be minimized or topping out if maximized
Determining the Number of Clusters Elbow method Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters Cross validation method Divide a given data set into m parts Use m 1 parts to obtain a clustering model Use the remaining part to test the quality of the clustering E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k s, and find # of clusters that fits the data the best 16
Understanding the results of Clustering Let us take the whiskey example and consider that following are 2 of the clusters formed from the data Group A, Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa Group H, Scotches: Bruichladdich, Deanston, Fettercairn, Glenfiddich, Glen Mhor, Glen Spey, Glentauchers, Ladyburn, Tobermory Thus, to examine the clusters, we can look at the whiskeys in each cluster. Even if we had had massive numbers of whiskeys, we still could have sampled whiskeys from each cluster to show the composition of each. In this case, the names of the data points are meaningful in and of themselves, and convey meaning to an expert in the field. But, if we are clustering customers of a large retailer, probably a list of the names of the customers in a cluster would have little meaning, so this technique for understanding the result of clustering would not be useful.
Understanding the results of Clustering What can we do in cases where we cannot simply show the names of our data points, or for which showing the names does not give sufficient understanding? Let s look again at our whiskey clusters, but this time looking at more information on the clusters. Group A o Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa o The best of its class: Laphroaig (Islay), 10 years, 86 points o Average characteristics: full gold; fruity, salty; medium; oily, salty, sherry; Group H o Scotches: Bruichladdich, Deanston, Fettercairn, Glenfiddich, Glen Mhor, Glen Spey, Glentauchers, Ladyburn, Tobermory o The best of its class: Bruichladdich (Islay), 10 years, 76 points o Average characteristics: white wyne, pale; sweet; smooth, light; sweet, dry, fruity, smoky; dry, light
Understanding the results of Clustering First, in addition to listing out the members, an exemplar member is listed. Here it is the best of its class whiskey These techniques could be especially useful when there are massive numbers of instances in each cluster, so randomly sampling some may not be as telling as carefully selecting exemplars. The example also illustrates a different way of understanding the result of the clustering: it shows the average characteristics of the members of the cluster essentially, it shows the cluster centroid. Showing the centroid can be applied to any clustering; whether it is meaningful depends on whether the data values themselves are meaningful.
Requirements and Challenges Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of these Constraint-based clustering User may give inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionality
Strengths and Weaknesses of Each algorithm K-means Fast and efficient Often terminates at a local optimal solution Applicable to continuous n-dimensional space Need to specify the number of clusters in advance Sensitive to noisy data and outliers Weak in clustering non-convex shapes Hierarchical Availability of dendrogram Smaller clusters may be generated. Not so scalable; Distance matrix can huge to calculate Cannot undo what was done previously
Measuring Clustering Quality Two methods: extrinsic vs. intrinsic Extrinsic: supervised, i.e., the ground truth is available Compare a clustering against the ground truth using certain clustering quality measure E.g.,BCubed precision and recall metrics Intrinsic: unsupervised, i.e., the ground truth is unavailable Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are E.g., Silhouette coefficient 22
Silhouette Coefficient For each datum i a i : average dissimilarity of i with all other data within the same cluster The smaller the better; Indicates cohesiveness b(i): the lowest average dissimilarity of i to any other cluster, of which i is not a member Cluster with the lowest dissimilarity is called neighboring cluster of i The larger the better; Indicates distinctiveness s i = b i a(i) max{a i, b(i)} = a i 1 b i if a i < b(i) 0 if a i = b(i) b i a i 1 if a i > b(i) 1 s(i) 1 and s(i) close to 1 indicates the datum is clustered well, s(i) negative indicates that it would be appropriate if it was clustered in its neighboring cluster. The average s(i) is therefore an overall measure that incorporates how good a clustering result is in terms of both cohesiveness and distinctiveness.
Sources F. Provost and T. Fawcett, Data Science for Business J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.