Center of Atmospheric Sciences, UNAM November 16, 2016
Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Cluster analisis is used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. https://en.wikipedia.org/wiki/cluster_analysis
Types of cluster models () Typical cluster models include: clustering. Based on distance connectivity Agglomerative (bottom up). Each observation starts in its own cluster, and pairs of clusters are merged as one moves up. Divisive (top down). All observations start in one cluster. https://en.wikipedia.org/wiki/_clustering https://upload.wikimedia.org/wikibooks/en/2/28/agglomerative_clustering_dendogram.png
Types of cluster models () In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. Some commonly used metrics are: Euclidean distance. a b 2 = i (a i b i ) 2 Squared Euclidean distance. a b 2 2 = i (a i b i ) 2 Manhattan distance. a b 1 = i a i b i Maximum distance. a b = max i a i b i Mahalanobis distance. (a b) T S 1 (a b) (S is the covariance matrix). https://en.wikipedia.org/wiki/_clustering
Linkage criteria The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly linkage criteria are: Maximum or complete-linkage. max{d(a, b) : a A, b B} Minimum or single-linkage. min{d(a, b) : a A, b B} Mean or average-linkage. a A b B d(a, b) 1 A B Centroid linkage. c a c b where C a and c b are the centroids of clusters A and B respectively. https://en.wikipedia.org/wiki/_clustering
Example Using the complete-linkage and the manhattan distance, perform cluster analysis on the following data points: a = [0,0] b = [0,1] c = [10,3] d = [4,2]
Exercise Using the single-linkage and the maximum distance, perform cluster analysis on the following data points: a = [0,0] b = [0,1] c = [10,3] d = [4,2]
Types of cluster models Other types of cluster models: Centroid models. Distribution models. Using statistical distributions. Density models. Define clusters as connected dense regions. Example: DBSCAN and OPTICS Graph-based. Clusters are represented as subset nodes in a graph. Connected nodes belong to the same cluster. Types of clustering: Hard clustering: Each object belongs to a cluster or not. Fuzzy clustering: Each object belongs to each cluster to a certain degree.
clustering (1967 James MacQueen) aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results in a partitioning of the data space into Voronoi cells. Given a set of observations x 1, x 2,, x n, where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k n sets S = {S 1, S 2,, S k } so as to minimize the within-cluster sum of squares. In other words, its objective is to find: argmin S k i=1 where µ i is the mean of points in Si. https://en.wikipedia.org/wiki/_clustering x S i x µ i 2 (1)
algorithm 1 Randomly initialize k cluster centers: µ i 2 Assign each observation to its closest cluster. { Si t = x p : x p µ t i 2 x p µ t j 2 } j, 1 j k 3 Update each cluster center to the mean of the observations belonging to that cluster. µ t+1 i = 1 x j Si t (3) x j S t i (2) 4 Repeat steps 2 and 3 until the position of the centers do not change, or the change is minimum.
Example Program k-means in python! Yeah!