Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu

Clustering Overview

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Clustering is the process of organizing objects into groups whose members are similar in some way. Discover new categories in an unsupervised manner (no sample category labels provided).

Ch. 16 A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case?

Hierarchy isn t clustering but is the kind of output you want from clustering (30) agriculture biology physics CS space...... dairy crops botany cell forestry agronomy evolution magnetism relativity......... AI HCI courses craft missions

Why clustering? A few good reasons... Simplifications Pattern detection Useful in data concept construction Unsupervised learning process

Where to use clustering? Data mining, information retrieval, text mining, Web analysis, marketing, medical diagnostic, etc. Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Which method should I use? Type of attributes in data Scalability to larger dataset Ability to work with irregular data Time cost Complexity Data order dependency Result presentation

Major Existing clustering methods Distance-based: based on connectivity and density functions Hierarchical: create a hierarchical decomposition of the set of data (or objects) using some criterion Partitioning: construct various partitions and then evaluate them by some criterion Probabilistic: a model is hypothesized for each of the clusters

Clustering Algorithms Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Flat algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering (Model based clustering)

Hard vs. soft clustering Hard clustering: each sample belongs to exactly one cluster More common and easier to do Soft clustering: a sample can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes You can only do that with a soft clustering approach.

Clustering Algorithms Start with a collection of n objects each represented by a p - dimensional feature vector x i, i=1, n. The goal is to divide these n objects into k clusters so that objects within a clusters are more similar than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM (Self Organizing Map), mixture models, etc.

Sec. 16.2 Issues for clustering Representation for clustering Object representation Vector space? Normalization? Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid trivial clusters - too large or small

Hierarchical Clustering

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean Recursive application of a standard clustering algorithm can produce a hierarchical clustering.

Aglommerative vs. Divisive Clustering Multilevel clustering: level 1 has n clusters à level n has one cluster. Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. starts with singleton and merge clusters. Divisive (partitional, top-down) separate all examples immediately into clusters. starts with one sample and split clusters.

Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data

Nearest Neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.

Nearest Neighbor, Level 2, k = 7 clusters.

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 4, k = 5 clusters.

Nearest Neighbor, Level 5, k = 4 clusters.

Nearest Neighbor, Level 6, k = 3 clusters.

Nearest Neighbor, Level 7, k = 2 clusters.

Nearest Neighbor, Level 8, k = 1 cluster.

Hierarchical Clustering Calculate the similarity between all possible combinations of two profiles Keys Similarity Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Until there is only one cluster

Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually similarity measures are in some sense the inverse of distance metrics. How to compute similarity of two clusters each possibly containing multiple instances? Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members.

Clustering Single Linkage + + C 2 Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters C 1 Tend to generate long chains

Clustering Complete Linkage + + C 2 Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters C 1 Tend to generate clumps

Clustering Average Linkage + + C 2 Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). C 1

Clustering Average Group Linkage + + C 2 Dissimilarity between two clusters = Distance between two cluster means. C 1

Single Link Agglomerative Clustering Use maximum similarity of pairs: sim( c i, c j ) = max x c, y i c j sim( x, y) Can result in straggly (long and thin) clusters due to chaining effect. Appropriate in some domains, such as clustering islands.

Example Combine D and F!

Now, how to calculate new distances?

Complete Link Agglomerative Clustering Use minimum similarity of pairs: sim( c i, c j ) = x c min, y i c j sim( x, y) Makes more tight, spherical clusters that are typically preferable.

Computational Complexity In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n 2 ). In each of the subsequent n-2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n 2 ) performance, computing similarity to each other cluster must be done in constant time.

Computing Cluster Similarity After merging c i and c j, the similarity of the resulting cluster to any other cluster, c k, can be computed by: Single Link: sim (( ci c j ), ck ) = max( sim( ci, ck ), sim( c j, ck )) Complete Link: sim (( ci c j ), ck ) = min( sim( ci, ck ), sim( c j, ck ))

Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. sim( c i, c j ) = c i c j 1 ( c i c 1) Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters to encourage tight clusters. j!!!! x ( c c ) y ( c c ): y x i j i j!! sim( x, y)

Computing Group Average Similarity Assume cosine similarity and normalized vectors with unit length. Always maintain sum of vectors in each cluster. Compute similarity of clusters in constant time: = c j x c j x s!!! ) ( 1) )( ( ) ( )) ( ) ( ( )) ( ) ( ( ), ( + + + + + = j i j i j i j i j i j i c c c c c c c s c s c s c s c c sim!!!!

Non-Hierarchical Clustering Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.

K-Means Assumes instances are real-valued vectors. Clusters based on centroids, center of gravity, or mean of points in a cluster, c:! µ(c) = 1! x c! IcI is the number of data points in cluster c Reassignment of instances to clusters is based on distance to the current cluster centroids. x c

Distance Metrics Euclidian distance (L 2 norm): m!! L2 ( x, y) = ( x i yi ) i= 1 L 1 norm: m!! L1 ( x, y) = x i y i i= 1 Cosine Similarity (transform to a distance by subtracting from 1):!! x y 1!! x y 2

K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s 1, s 2, s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c j such that d(x i, s j ) is minimal. (Update the seeds to the centroid of each cluster) For each cluster c j s j = m(c j ) // recalculate centroids

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

Sec. 16.4 Termination conditions Several possibilities, e.g., A fixed number of iterations. Partition unchanged. Centroid positions don t change.

Sec. 16.4 Convergence Why should the K-means algorithm ever reach a fixed point? A state in which clusters don t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large. But in practice usually isn t

Time Complexity Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm). Computing centroids: Each instance vector gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(Iknm). Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n 2 ) HAC.

A Simple example showing the implementation of k-means algorithm (using K=2)

Step 1: Initialization: Randomly we choose following two centroids (k=2) for two clusters. In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).

Step 2: Thus, we obtain two clusters containing: {1,2,3} and {4,5,6,7}. Their new centroids are:

Step 3: Now using these centroids we compute the Euclidean distance of each object, as shown in table. Therefore, the new clusters are: {1,2} and {3,4,5,6,7} Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1)

Step 4: The clusters obtained are: {1,2} and {3,4,5,6,7} Therefore, there is no change in the cluster. Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}.

PLOT

(with K=3) Step 1 Step 2

PLOT

Getting the k right How to select k? Try different k, looking at the change in the average distance to centroid as k increases Average falls rapidly until right k, then changes little Average distance to centroid k Best value of k

Example: Picking k Too few; many long distances to centroid. x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x

Example: Picking k Just right; distances rather short. x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x

Example: Picking k Too many; little improvement in average distance. x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x

Strengths: Strengths of k-means Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. Since both k and t are small. k-means is considered a linear algorithm. K-means is the most popular clustering algorithm. Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

Weaknesses of k-means The algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values.

Weaknesses of k-means: Problems with outliers

Weaknesses of k-means: To deal with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

Weaknesses of k-means (cont ) The algorithm is sensitive to initial seeds.

Weaknesses of k-means (cont ) If we use different seeds: good results There are some methods to help choose good seeds

Weaknesses of k-means (cont ) The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). +

K-means summary Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and other clustering algorithms have their own lists of weaknesses. No clear evidence that any other clustering algorithm performs better in general although they may be more suitable for some specific types of data or applications. Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!

Cluster Evaluation: hard problem The quality of a clustering is very hard to evaluate because We do not know the correct clusters Some methods are used: User inspection Study centroids, and spreads Rules from a decision tree. For text documents, one can read some documents in clusters.

Cluster evaluation: ground truth We use some labeled data (for classification) Assumption: Each class is a cluster. After clustering, a confusion matrix is constructed. From the matrix, we compute various measurements, entropy, purity, precision, recall and F-score. Let the classes in the data D be C = (c 1, c 2,, c k ). The clustering method produces k clusters, which divides D into k disjoint subsets, D 1, D 2,, D k.

About ground truth evaluation Commonly used to compare different clustering algorithms. A real-life data set for clustering has no class labels. Thus although an algorithm may perform very well on some labeled data sets, no guarantee that it will perform well on the actual application data at hand. The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm. This evaluation method is said to be based on external data or information.

Evaluation based on internal information Intra-cluster cohesion (compactness): Cohesion measures how near the data points in a cluster are to the cluster centroid. Sum of squared error (SSE) is a commonly used measure. Inter-cluster separation (isolation): Separation means that different cluster centroids should be far away from one another. In most applications, expert judgments are still the key.

Holes in data space All the clustering algorithms only group data. Clusters only represent one aspect of the knowledge in the data. Another aspect that we have not studied is the holes. A hole is a region in the data space that contains no or few data points. Reasons: insufficient data in certain areas, and/or certain attribute-value combinations are not possible or seldom occur.

Holes are useful too Although clusters are important, holes in the space can be quite useful too. For example, in a disease database we may find that certain symptoms and/or test values do not occur together, or when a certain medicine is used, some test values never go beyond certain ranges. Discovery of such information can be important in medical domains because it could mean the discovery of a cure to a disease or some biological laws.

Data regions and empty regions Given a data space, separate data regions (clusters) and empty regions (holes, with few or no data points). Use a supervised learning technique, i.e., decision tree induction, to separate the two types of regions. Due to the use of a supervised learning method for an unsupervised learning task, an interesting connection is made between the two types of learning paradigms.

Clustering Summary Clustering is has a long history and still active There are a huge number of clustering algorithms More are still coming every year. We only introduced several main algorithms. There are many others, e.g., density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc. Clustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year. Clustering is highly application dependent and to some extent subjective.