Clustering algorithms - PDF Free Download

Clustering algorithms Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 1 / 22

Table of contents 1 Supervised & unsupervised learning 2 Clustering 3 Hierarchical clustering 4 Non-Hierarchical Clustering Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 2 / 22

Supervised & unsupervised learning The learning methods covered in class up to this point have focused on the issue of classification/regression. An example consisted of a pair of variables (x, t), where x a feature vector and t the label/value. Such learning problems are called supervised since the system is given both the feature vector and the correct answer. We will investigate methods that operate on unlabeled data. Given a collection of feature vectors X = {x 1, x 2,..., x N } without labels/values t i, these methods attempt to build a model that captures the structure of the data. These methods are called unsupervised since they are not provided with the correct answer. The unsupervised learning methods may appear to have limited capabilities, there are several reasons that make them useful Labeling large data sets can be a costly procedure but raw data is cheap. Class labels may not be known beforehand. Large datasets can be compressed by finding a small set of prototypes. One can train with large amount of unlabeled data, and then use supervision to label the groupings found. Unsupervised methods can be used for feature extraction. Exploratory data analysis can provide insight into the nature or structure of the data. Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 3 / 22

Unsupervised Learning Unsupervised learning algorithms Non-parametric methods: These methods don t make any assumption about the underlying densities, instead we seek a partition of the data into clusters. Parametric methods: These methods model the underlying class-conditional densities with a mixture of parametric densities, and the objective is to find the model parameters. p(x θ) = i p(x ω i, θ i )p(ω i ) Examples of unsupervised learning Dimensionality reduction Latent variable learning Clustering A cluster is a number of similar objects collected or grouped together. Clustering algorithm partitions examples into groups when labels are available. Sample applications Novelty detection and outliers detection. Clusters are connected regions of a multidimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 4 / 22

Application of Clustering Cluster retrieved documents to present more organized and understandable results to user diversified retrieval Detecting near duplicates such as entity resolution Exploratory data analysis Automated (or semi-automated) creation of taxonomies Comparison Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 5 / 22

Why do Unsupervised Learning? Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. How many clusters do you see in the above figure? Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 6 / 22

Why do Unsupervised Learning? (cont.) How many clusters do you see in the figure? Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 7 / 22

Why do Unsupervised Learning? (cont.) How many clusters do you see in the figure? Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 8 / 22

Why do Unsupervised Learning? (cont.) How many clusters do you see in the figure? Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 9 / 22

Why do Unsupervised Learning? (cont.) How many clusters do you see in the figure? Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 10 / 22

Clustering Clustering algorithms can be divided into several groups Exclusive (each pattern belongs to only one cluster) Vs non-exclusive (each pattern can be assigned to several clusters). Hierarchical (nested sequence of partitions) Vs partitioned (a single partition). Clustering algorithms Hierarchical clustering Centroid-based clustering Distribution-based clustering Density-based clustering Grid-based clustering Constraint clustering Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 11 / 22

Clustering Challenges in the clustering Selection of an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. Choice of the criterion function to be optimized. Evaluation function Optimization method Similarity/distance measures Euclidean distance (L 2 norm) L 2 (x, y) = Σ N i=1 (x i y i ) 2 L 1 norm: Cosine similarity: L 1 (x, x) = cosine(x, y) = Σ N i=1 x i y i xy x y Evaluation function that assigns a (usually real-valued) value to a clustering. This function typically function of withing-cluster similarity and between-cluster dissimilarity. Optimization method : Find a clustering that maximize the criterion. This can be done by global optimization methods (often intractable), greedy search methods, and approximation algorithms. Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 12 / 22

Hierarchical clustering Organizes the clusters in a hierarchical way Produces a rooted tree (Dendrogram) Animal Vertebrate Invertebrate Fish Reptile Amphibian Mammal Worm Insect Crustacean Recursive application of a standard clustering algorithm can produce a hierarchical clustering Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 13 / 22

Hierarchical clustering (cont.) Organize the clusters in a hierarchical way. Types of hierarchical clustering Agglomerative (bottom-up): Methods start with each example in its own cluster and Produces iteratively combine a them rooted to formbinary larger and larger tree clusters. (dendrogram). Divisive(top-down): Methods separate all examples recursively into smaller clusters. Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 14 / 22

Agglomerative (bottom up) Assumes a similarity function for determining the similarity of two clusters. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy Basic algorithms: Start with all instances in their own cluster Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j that are most similar Replace c i and c j with a single cluster c i c j Cluster Similarity: How to compute similarity of two clusters each possibly containing multiple instances? Single Linkage: Similarity of two most similar members. Complete Linkage: Similarity of two least similar members. Group Average: Average similarity between members. This method uses the average of similarity across all pairs within the merged cluster to measure the similarity of two clusters. This method is a compromise between single and complete link. Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 15 / 22

Single-Link (bottom-up) sim(c i, c j ) = max x ci,y c j sim(x, y) Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 16 / 22

Compelete-Link (bottom-up) sim(c i, c j ) = min x ci,y c j sim(x, y) Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 17 / 22

Computational Complexity of HAC In the first iteration, all HAC methods need to compute similarity of all pairs n individual instances which is O(n 2 ). In each of the subsequent O(n) merging iterations, must find smallest distance pair of clusters Maintain heap O(n 2 log(n)) In each of the subsequent O(n) merging iterations, it must compute the distance between the most recently created cluster and all other existing cluster. Can this be done in constant time such that O(n 2 log(n)) overall? Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 18 / 22

Centroid-Based Clustering Assumes instances are real-valued vectors. Clusters represented via centroids (for example, average of points in a cluster) c µ(c) = 1 c x Reassignment of instances to clusters is based on distance to the current cluster K-Means algorithm Input: k = number of clusters, distance measure d, Select k random instances s 1, s 2,..., s k as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to cluster c j such that d(x i, s j ) is minimum For each cluster c j,update its centroid x c s j = µ(c j ) Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 19 / 22

Time Complexity Assume computing distance between two instances is O(D), where D is the dimensionality of the vectors. Reassigning clusters for N points: O(kN) distance computations, or O(kND). Computing centroids: Each instance gets added once to some centroid: O(ND). Assume these two steps are each done once for m iterations: O(mkND). Problems with K-means Results can vary based on random seed selection, especially for high-dimensional data. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Sensitive to outliers Idea: Combine HAC and K-means clustering. Convergence of K-means Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 20 / 22

Gaussian mixture model A mixture model is a linear combination of K densities K p(x θ) = π k N (x µ k, Σ k ) Set of parameters θ = {{π k }, {µ k }, {Σ k }} π is a discrete distribution, i.e. 0 π k 1 and K k=1 π k = 1. Each component is a multi-variate Gaussian 1 ( ) N (x µ k, Σ k ) = (2π) D/2 Σ k exp (x µ k ) T Σ 1 k (x µ k) k=1 To generate a sample x from the mixture model: (1) sample mixture component z π, (2) sample x R D from the z th component x N (µ z, Σ z ). An alternative viewpoint: z is a 1 of K binary vector The posterior distribution p(x) = z p(x z)p(z) = p(z k x) = p(x z k)p(z k ) p(x) K π k N (x µ k, Σ k ) k=1 = π kn (x µ k, Σ k ) K j=1 π jn (x µ j, Σ j ) Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 21 / 22

Gaussian Mixtures and EM Initialize π, µ, and Σ. Repeat E-Step Evaluate the posterior probabilities p(z k x n ) = π kn (x µ k, Σ k ) k j=1 π jn (x µ j, Σ j ) M-Step Update the parameter values Until Convergence µ k = 1 K p(z k x n )x n N k n=1 Σ k = 1 K p(z k x n )(x n µ k )(x n µ k ) T N k Σ k = N k N n=1 Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1393 22 / 22