Unsupervised learning, Clustering CS434
Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,, x 2 m x 1n,x 2n, x 3n,, x n m y 1 y 2 y n x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,, x 2 m x 1n,x 2n, x 3n,, x n m What do we expect to learn from such data? I have tons of data (web documents, gene data, etc) and need to: organize it better e.g., find subgroups understand it better e.g., understand interrelationships find regular trends in it e.g., If A and B, then C
Hierarchical clustering as a way to organize web pages
Finding association patterns in data
Dimension reduction
clustering
A hypothetical clustering example Monthly spending.......... Each point represents a person in the database...... Monthly income
A hypothetical clustering example Monthly spending................ This suggests there maybe three distinct spending patterns among the group of people in our database. Monthly income
What is Clustering In general clustering can be viewed as an exploratory procedure for finding interesting subgroups in given data A large portion of the efforts focus on a special kind of clustering: Group all given examples (i.e., exhaustive) into disjoint clusters (i.e., partitional), such that: Examples within a cluster are (very) similar Examples in different clusters are (very) different
Some example applications Information retrieval: cluster retrieved documents to present more organized and understandable results Consumer market analysis: cluster consumers into different interest groups so that marketing plans can be specifically designed for each individual group Image segmentation: decompose an image into regions with coherent color and texture Vector quantization for data (i.e., image) compression: group vectors into similar groups, and use group mean to represent group members Computational biology: group gene into co-expressed families based on their expression profile using different tissue samples and different experimental conditions
Image compression: Vector quantization Group all pixels into self-similar groups, instead of storing all pixel values, store the means of each group 701,554 bytes 127,292 bytes
Important components in clustering Distance/similarity measure How to measure the similarity between two objects? Clustering algorithm How to find the clusters based on the distance/similarity measures Evaluation of results How do we know that our clustering result is good
Distance/similarity Measures One of the most important question in unsupervised learning, often more important than the choice of clustering algorithms What is similarity? Similarity is hard to define, but we know it when we see it quoted from Eamonn Keogh (UCR) More pragmatically, there are many mathematical definitions of distance/similarity
Common distance/similarity measures Euclidean distance Manhattan distance L d 2 2 x i x i 2 ( x, x ) ( ) i 1 Straight line distance >=0 City block distance >=0 Cosine similarity Angle between two vectors, commonly used for measuring document similarity More flexible measures: D d ( x, x ) wi ( xi xi i 1 one can learn the appropriate weights given user guidance ) 2 Scale each feature differently using w i s Note: We can always transform between distance and similarity using a monotonically decreasing function, for example,
How to decide which to use? It is application dependant Consider your application domain, you need to ask questions such as: What does it mean for two consumers to be similar to each other? Or for two genes to be similar to each other? For example, for text domain, we typically use cosine similarity This may or may not give you the answer you want depends on your existing knowledge of the domain
A learning approach Ideally we d like to learn a distance function from user s inputs Ask users to provide things like object A is similar to object B, dissimilar to object C For example, if a user want to group the marbles based on their pattern They don t belong together! They belong together! Learn a distance function to correctly reflect these relationships E.g. weigh pattern-related features more than color features This is a more advanced topic that we will not cover in this class, but nonetheless very important
When we can not afford to learn a distance measure, and don t have a clue about which distance function is appropriate, what should we do? Clustering is an exploratory procedure and it is important to explore i.e., try different options
Clustering Now we have seen different ways that one can use to measure distances or similarities among a set of unlabeled objects, how can we use them to group the objects into clusters? People have looked at many different approaches and they can be categorized into two distinct types
Hierarchical and non-hierarchical Clustering Hierarchical clustering builds a tree-based hierarchical taxonomy (dendrogram) animal vertebrate invertebrate fish reptile mammal worm insect A non-hierarchical (flat) clustering produces a single partition of the unlabeled data Choosing a particular level in the hierarchy produces a flat clustering Recursive application of flat clustering can also produce a hierarchical clustering.
Hierarchical Agglomerative Clustering (HAC) Assumes a distance function for determining the similarity of two instances. One can also assume a similarity function, and reverse some of the operations (e.g., minimum distance -> maximum similarity) in the algorithm to make it work for similarities Starts with each object in a separate cluster and then repeatedly joins the two closest clusters until only one cluster is left
HAC Example A B C D E F G H A B C D E F G H
HAC Algorithm Start with all objects in their own cluster. Repeat until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are closest Replace c i and c j with a single cluster c i c j Problem: we assume a distance/similarity function that computes distance/similarity between examples, but here we also need to compute distance between clusters. How?
Distance Between Clusters Assume a distance function that determines the distance of two objects: D(x,x ). There are multiple way to define a cluster distance function: Single Link: distance of two closest members of clusters D(C 1, C 2 ) C 1 C 2 Complete Link: distance of two furthest members of clusters D(C 1, C 2 ) C 1 C 2 Average Link: average distance
Single Link Agglomerative Clustering Use minimum distance of all pairs: D( C, C ) min D( x, x' ) i j x c i, x' c j A B C D E F G H A B C D E F G H
Single-link s chaining effect Single link is famous for its chaining effect It can gradually adds more and more examples to the chain Create long straggling clusters
Complete Link Maximum distance of all pairs: D( C i, C j ) max x c, x' i c j D( x, x') A B C D E F G H A B C D E F G H Makes tight, spherical clusters
Complete link is outlier sensitive
Updating the Cluster Distances after merging is a piece of After merging c i and c j, the distance of the resulting cluster to any other cluster, c k, can be computed by: Single Link: D(( ci c j ), ck ) min( D( ci, ck ), D( c j, ck )) Complete Link: D(( ci c j ), ck ) max( D( ci, ck ), D( c j, ck )) This is constant time given previous distances
Average Link Basic idea: average similarity between members of the two clusters Averaged across all ordered pairs in the merged cluster D( c i, c j ) c i c j 1 ( c i c j 1) x ( c c ) x' ( c c ): x' x i j i j D( x,x' ) Compared to single link and complete link: Computationally more expensive naively it can be O(n 2 ) to compute the new distance between a pair of clusters If we use cosine similarity, we can compute the new distance in constant time Achieves a compromise between single and complete link
HAC creates a Dendrogram Dendrogram draws the tree such that the height of a tree branch = the distance between the two merged clusters at that particular step The distances are always monotonically increasing D(C1, C2)
How many clusters Cutting the dendrogram at a specific similarity level gives us a simple flat clustering Different cutting places lead different numbers of clusters One option is to cut the dendrogram where the gap between two successive combination similarities is largest
Comments on HAC HAC is a convenient tool that often provides interesting views of a dataset Primarily HAC can be viewed as an intuitively appealing clustering procedure for data analysis/exploration We can create clusterings of different granularity by stopping at different levels of the dendrogram HAC often used together with visualization of the dendrogram to decide how many clusters exist in the data Different linkage methods (single, complete and average) often lead to different solutions