SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 1 / 32
Contents of This Lecture 1 Hierarchical Clustering 2 Quick Partitions 3 Mixture Models 4 Sum-of-squares 5 Spectral Clustering 6 Cluster Validity 7 Other Unsupervised Schemes J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 2 / 32
Material Chapter 11 in WebCop:2011 and Section 14.10 in HasTibFri:2009 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 3 / 32
What Should You Already Know? Basics of hierarchical clustering, k-means, mixture models, and SOM depending on the basic course you ve taken. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 4 / 32
Clustering Given a set of points divide these into c clusters based on their similarity. k may or may not be known. Different definitions of similarity give rise to different clustering algorithms. More formally, given a set of points D = {x 1,..., x n }, the task is to place each of these into one of the c classes, i.e., to find c sets D 1,..., D c so that c D i = D and for all i j. i=1 D j D i = J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 5 / 32
Hierarchical Clustering Hierarchical methods Derive clustering from from a given dissimilarity matrix Means for summarizing data structure via dendograms Divisive or agglomerative (latter much more common) Agglomerative clustering works by merging two closest clusters, starting from n clusters J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 6 / 32
Hierarchical Clustering Hierarchical Methods 1 / 2 Different definitions of the closest yield different algorithms Single-link or nearest neighbour Complete-link or farthest neighbour Average-link Ward s method J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 7 / 32
Hierarchical Clustering Hierarchical Methods 2 / 2 Single-link seeks isolated clusters but suffers from chaining effect that (usually) is undesirable. Complete-link, average-link and Ward tend to concentrate on internal cohesion producing compact and often spherical clusters Mathematical definition of the clustering quality difficult J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 8 / 32
Quick Partitions Quick Partitions For initial partitions of data Random k selection Variable division Leader algorithm J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 9 / 32
Mixture Models Mixture Models A mixture density p(x) = c π j p(x θ j ). (1) j=1 The priors π j in mixture densities are mixing parameters, and the class conditional densities are called component densities. Most commonly component densities are normal J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 10 / 32
Mixture Models EM Algorithm for Gaussian Mixture Models 1 Initialize µ 0 j, Σ0 j, P0 (ω j ), set t 0. 2 (E-step) Compute the posterior probabilities p t+1 ij = 3 (M-step) Update the parameter values π t j p normal(x i µ t j, Σt j ) g k=1 πt k p normal(x i µ t k, Σt k ). π t+1 j = (1/n) i p t+1 ij (2) 4 Set t t + 1. µ t+1 j = Σ t+1 j = i pt+1 ij x i i pt+1 ij i pt+1 ij (x i µ t+1 j )(x i µ t+1 j i pt+1 ij 5 Stop if a convergence criterion is met, otherwise return to step 2. ) T (3) (4) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 11 / 32
Mixture Models How Many Components? Different information theoretic model selection criteria AIC BIC etc. etc. Model selection criteria evaluate the model fit while penalizing for the number of parameters in the model. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 12 / 32
Mixture Models Other Difficulties Switching problem Local minima (can be reduced by a good initialization or a high entropy initialization) Maximum likelihood solution might not exist J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 13 / 32
Sum-of-squares Sum-of-Squares Criteria Given a set of n data samples, partition the data into c clusters so that the clustering criterion is optimized. The simplest being the sum-of-squares (or k-means) criterion: J(D 1,..., D c ) = c i=1 x D i x µ i 2. (5) Various related criteria based scatter matrices. Produces sphere-like clusters J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 14 / 32
Sum-of-squares K-Means Algorithm k-means algorithm Initialize µ 1 (0),..., µ c (0), set t 0 repeat Classify each x 1,..., x n to the class D j (t) whose mean vector µ j (t) is the nearest to x i. for k = 1 to c do update the mean vectors µ k (t + 1) = 1 end for Set t t + 1 until clustering did not change Return D 1 (t 1),..., D c (t 1). D k (t) x D k (t) x K-means demo J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 15 / 32
Sum-of-squares K-Means Properties TRUE K-means EM Examples of problematic clustering tasks for k-means. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 16 / 32
Sum-of-squares Fuzzy K-Means Fuzzy k-means algorithm Initialize y ij, set t 0 repeat for j = 1 to k do Compute the mean vectors m j = 1 n n i=1 y ij r i=1 y ij r x i end for for j = 1 to k, i = 1 to n do Compute the distances as d ij = x i m j end for for j = 1 to k, i = 1 to n do 1 Compute the membership function as y ij = k c=1 (d ij /d ic ) 2/(r 1) end for Set t t + 1 until clustering did not change Return membership values y ij. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 17 / 32
Sum-of-squares Self-Organizing Feature Maps Aim is to present high-dimensional data as 1-D or 2-D array of number that captures the structure in the original data. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 18 / 32
Spectral Clustering Spectral Clustering 1 / 2 How to cluster J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 19 / 32
Spectral Clustering Spectral Clustering 1 / 2 How to cluster Spectral clustering (normalized cuts, http://www.cis.upenn.edu/~jshi/software/demo1.html) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 19 / 32
Spectral Clustering Spectral Clustering 2 / 2 Idea/motivation: Represent data as a graph where the nodes are the data points and the edge weights are (inversely) proportional to distances between the data points. Graphs do not have to be complete, i.e., every node pair does not have to be connected by an edge. Clustering is generated by making cuts to the graph (note that graph cuts in image segmentation are based on similar idea, but the graph construction is different) Graph cuts are often impossible to compute exactly (NP-complete problems), so spectral clustering utilizes approximations based linear algebra. There are other interpretations /motivations to spectral clustering J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 20 / 32
Spectral Clustering Graph Theory 1 / 2 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 21 / 32
Spectral Clustering Graph Theory 2 / 2 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 22 / 32
Spectral Clustering Selecting Edge Adjacencies For example, the elements in the adjacency matrix A can be set a ij = exp( x i x j 2 /σ) if x i is one of k-nearest neighbours of x j and a ij = 0 otherwise. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 23 / 32
Spectral Clustering The Graph Laplacian The nonsymmetric (Ncut) weighted graph Laplacian is L = I D 1 A where A is the adjacency matrix and D is the diagonal matrix with entries n d ii = on the diagonal. j=1 a ij J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 24 / 32
Spectral Clustering Clustering Algorithm (k Clusters) 1 Form the graph adjacency matrix based on data 2 Compute the Laplacian 3 Solve the generalized eigenvalue problem Lv = λdv and select the eigenvectors v 1,..., v k corresponding to k smallest eigenvalues. Datapoint x 1 is mapped to x 1 = [v 11,..., v k1 ] T and so on. 4 Perform k-means clustering of the mapped datapoints x i. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 25 / 32
Cluster Validity Cluster Validity Evaluation of the clustering: Is the cluster structure property of the data (as it should) or imposed by a particular clustering algorithm (as it should not)? Very, very difficult in high dimensions Different criteria: 1 Internal 2 External 3 Relative Based on these criteria, we can statistically test different hypotheses on the cluster validity. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 26 / 32
Cluster Validity Evaluating Partitions: Rand Index From Wikipedia J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 27 / 32
Cluster Validity Evaluating Partitions: Adjusted Rand Index From Wikipedia, application example J.-P. Kauppi, I.P. Jaaskelainen, M. Sams, and J. Tohka. Clustering Inter-Subject Correlation Matrices in Functional Magnetic Resonance Imaging. IEEE-ITAB 2010. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 28 / 32
Other Unsupervised Schemes Other unsupervised schemes Various other unsupervised learning tasks in addition to clustering exist - we give just one example J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 29 / 32
Other Unsupervised Schemes PageRank 1 / 2 We have N web pages and wish to rank them in terms of importance. The Google PageRank algorithm considers a webpage to be important if many other webpages point to it. The linking webpages that point to a given page are not treated equally: the algorithm also takes into account both the importance (PageRank) of the linking pages and the number of outgoing links that they have. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 30 / 32
Other Unsupervised Schemes PageRank 2 / 2 L ij = 1 if page j points to page i and otherwise zero. c j = N i=1 L ij (the number of outlinks) PageRanks p i defined recursively via where d = 0.85, or p i = (1 d) + d N (L ij /c j )p j, j=1 p = (a d)1 + dldiag(c) 1 p. It can be shown that after proper normalization we get: p = Ap where A has one as its largest eigenvalue. Again: an eigenvalue problem, this time one solvable by the Power Method. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 31 / 32
Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32
Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data 2 The missing training data is replaced by a user-defined structure imposed to the pattern space J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32
Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data 2 The missing training data is replaced by a user-defined structure imposed to the pattern space 3 Various approaches, eg. dissimilaritiy measures, mixture models, spectral methods J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32
Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data 2 The missing training data is replaced by a user-defined structure imposed to the pattern space 3 Various approaches, eg. dissimilaritiy measures, mixture models, spectral methods 4 Cluster result evaluation: measures for cluster validity J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32