SGN (4 cr) Chapter 11

Similar documents
Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering CS 550: Machine Learning

Clustering Lecture 5: Mixture Model

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Cluster Analysis. Ying Shen, SSE, Tongji University

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Machine learning - HT Clustering

Unsupervised Learning

Clustering. K-means clustering

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Behavioral Data Mining. Lecture 18 Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Unsupervised Learning and Clustering

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Segmentation: Clustering, Graph Cut and EM

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Machine Learning. Unsupervised Learning. Manfred Huber

6.801/866. Segmentation and Line Fitting. T. Darrell

Lecture 7: Segmentation. Thursday, Sept 20

Visual Representations for Machine Learning

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

MSA220 - Statistical Learning for Big Data

Introduction to Machine Learning CMU-10701

Machine Learning for Data Science (CS4786) Lecture 11

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

SGN (4 cr) Chapter 10

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering Lecture 3: Hierarchical Methods

Hierarchical Clustering

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Unsupervised Learning and Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Methods for Intelligent Systems

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 11

Clustering: Classic Methods and Modern Views

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

CS 534: Computer Vision Segmentation and Perceptual Grouping

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Mixture Models and the EM Algorithm

Finding Clusters 1 / 60


CHAPTER 4: CLUSTER ANALYSIS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

COMS 4771 Clustering. Nakul Verma

Introduction to Machine Learning

6. Dicretization methods 6.1 The purpose of discretization

Image Segmentation. Selim Aksoy. Bilkent University

Image Segmentation. Selim Aksoy. Bilkent University

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Segmentation and low-level grouping.

CSE 5243 INTRO. TO DATA MINING

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning : Clustering

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

V4 Matrix algorithms and graph partitioning

Segmentation. Bottom Up Segmentation

The goals of segmentation

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Targil 12 : Image Segmentation. Image segmentation. Why do we need it? Image segmentation

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Network Traffic Measurements and Analysis

Segmentation Computer Vision Spring 2018, Lecture 27

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

CS Introduction to Data Mining Instructor: Abdullah Mueen

INF 4300 Classification III Anne Solberg The agenda today:

Normalized cuts and image segmentation

Unsupervised Learning: Clustering

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

University of Florida CISE department Gator Engineering. Clustering Part 2

Content-based image and video analysis. Machine learning

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Mining Social Network Graphs

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Community Detection. Community

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Chapter 6 Continued: Partitioning Methods

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Introduction to spectral clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Transcription:

SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 1 / 32

Contents of This Lecture 1 Hierarchical Clustering 2 Quick Partitions 3 Mixture Models 4 Sum-of-squares 5 Spectral Clustering 6 Cluster Validity 7 Other Unsupervised Schemes J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 2 / 32

Material Chapter 11 in WebCop:2011 and Section 14.10 in HasTibFri:2009 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 3 / 32

What Should You Already Know? Basics of hierarchical clustering, k-means, mixture models, and SOM depending on the basic course you ve taken. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 4 / 32

Clustering Given a set of points divide these into c clusters based on their similarity. k may or may not be known. Different definitions of similarity give rise to different clustering algorithms. More formally, given a set of points D = {x 1,..., x n }, the task is to place each of these into one of the c classes, i.e., to find c sets D 1,..., D c so that c D i = D and for all i j. i=1 D j D i = J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 5 / 32

Hierarchical Clustering Hierarchical methods Derive clustering from from a given dissimilarity matrix Means for summarizing data structure via dendograms Divisive or agglomerative (latter much more common) Agglomerative clustering works by merging two closest clusters, starting from n clusters J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 6 / 32

Hierarchical Clustering Hierarchical Methods 1 / 2 Different definitions of the closest yield different algorithms Single-link or nearest neighbour Complete-link or farthest neighbour Average-link Ward s method J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 7 / 32

Hierarchical Clustering Hierarchical Methods 2 / 2 Single-link seeks isolated clusters but suffers from chaining effect that (usually) is undesirable. Complete-link, average-link and Ward tend to concentrate on internal cohesion producing compact and often spherical clusters Mathematical definition of the clustering quality difficult J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 8 / 32

Quick Partitions Quick Partitions For initial partitions of data Random k selection Variable division Leader algorithm J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 9 / 32

Mixture Models Mixture Models A mixture density p(x) = c π j p(x θ j ). (1) j=1 The priors π j in mixture densities are mixing parameters, and the class conditional densities are called component densities. Most commonly component densities are normal J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 10 / 32

Mixture Models EM Algorithm for Gaussian Mixture Models 1 Initialize µ 0 j, Σ0 j, P0 (ω j ), set t 0. 2 (E-step) Compute the posterior probabilities p t+1 ij = 3 (M-step) Update the parameter values π t j p normal(x i µ t j, Σt j ) g k=1 πt k p normal(x i µ t k, Σt k ). π t+1 j = (1/n) i p t+1 ij (2) 4 Set t t + 1. µ t+1 j = Σ t+1 j = i pt+1 ij x i i pt+1 ij i pt+1 ij (x i µ t+1 j )(x i µ t+1 j i pt+1 ij 5 Stop if a convergence criterion is met, otherwise return to step 2. ) T (3) (4) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 11 / 32

Mixture Models How Many Components? Different information theoretic model selection criteria AIC BIC etc. etc. Model selection criteria evaluate the model fit while penalizing for the number of parameters in the model. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 12 / 32

Mixture Models Other Difficulties Switching problem Local minima (can be reduced by a good initialization or a high entropy initialization) Maximum likelihood solution might not exist J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 13 / 32

Sum-of-squares Sum-of-Squares Criteria Given a set of n data samples, partition the data into c clusters so that the clustering criterion is optimized. The simplest being the sum-of-squares (or k-means) criterion: J(D 1,..., D c ) = c i=1 x D i x µ i 2. (5) Various related criteria based scatter matrices. Produces sphere-like clusters J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 14 / 32

Sum-of-squares K-Means Algorithm k-means algorithm Initialize µ 1 (0),..., µ c (0), set t 0 repeat Classify each x 1,..., x n to the class D j (t) whose mean vector µ j (t) is the nearest to x i. for k = 1 to c do update the mean vectors µ k (t + 1) = 1 end for Set t t + 1 until clustering did not change Return D 1 (t 1),..., D c (t 1). D k (t) x D k (t) x K-means demo J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 15 / 32

Sum-of-squares K-Means Properties TRUE K-means EM Examples of problematic clustering tasks for k-means. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 16 / 32

Sum-of-squares Fuzzy K-Means Fuzzy k-means algorithm Initialize y ij, set t 0 repeat for j = 1 to k do Compute the mean vectors m j = 1 n n i=1 y ij r i=1 y ij r x i end for for j = 1 to k, i = 1 to n do Compute the distances as d ij = x i m j end for for j = 1 to k, i = 1 to n do 1 Compute the membership function as y ij = k c=1 (d ij /d ic ) 2/(r 1) end for Set t t + 1 until clustering did not change Return membership values y ij. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 17 / 32

Sum-of-squares Self-Organizing Feature Maps Aim is to present high-dimensional data as 1-D or 2-D array of number that captures the structure in the original data. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 18 / 32

Spectral Clustering Spectral Clustering 1 / 2 How to cluster J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 19 / 32

Spectral Clustering Spectral Clustering 1 / 2 How to cluster Spectral clustering (normalized cuts, http://www.cis.upenn.edu/~jshi/software/demo1.html) J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 19 / 32

Spectral Clustering Spectral Clustering 2 / 2 Idea/motivation: Represent data as a graph where the nodes are the data points and the edge weights are (inversely) proportional to distances between the data points. Graphs do not have to be complete, i.e., every node pair does not have to be connected by an edge. Clustering is generated by making cuts to the graph (note that graph cuts in image segmentation are based on similar idea, but the graph construction is different) Graph cuts are often impossible to compute exactly (NP-complete problems), so spectral clustering utilizes approximations based linear algebra. There are other interpretations /motivations to spectral clustering J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 20 / 32

Spectral Clustering Graph Theory 1 / 2 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 21 / 32

Spectral Clustering Graph Theory 2 / 2 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 22 / 32

Spectral Clustering Selecting Edge Adjacencies For example, the elements in the adjacency matrix A can be set a ij = exp( x i x j 2 /σ) if x i is one of k-nearest neighbours of x j and a ij = 0 otherwise. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 23 / 32

Spectral Clustering The Graph Laplacian The nonsymmetric (Ncut) weighted graph Laplacian is L = I D 1 A where A is the adjacency matrix and D is the diagonal matrix with entries n d ii = on the diagonal. j=1 a ij J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 24 / 32

Spectral Clustering Clustering Algorithm (k Clusters) 1 Form the graph adjacency matrix based on data 2 Compute the Laplacian 3 Solve the generalized eigenvalue problem Lv = λdv and select the eigenvectors v 1,..., v k corresponding to k smallest eigenvalues. Datapoint x 1 is mapped to x 1 = [v 11,..., v k1 ] T and so on. 4 Perform k-means clustering of the mapped datapoints x i. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 25 / 32

Cluster Validity Cluster Validity Evaluation of the clustering: Is the cluster structure property of the data (as it should) or imposed by a particular clustering algorithm (as it should not)? Very, very difficult in high dimensions Different criteria: 1 Internal 2 External 3 Relative Based on these criteria, we can statistically test different hypotheses on the cluster validity. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 26 / 32

Cluster Validity Evaluating Partitions: Rand Index From Wikipedia J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 27 / 32

Cluster Validity Evaluating Partitions: Adjusted Rand Index From Wikipedia, application example J.-P. Kauppi, I.P. Jaaskelainen, M. Sams, and J. Tohka. Clustering Inter-Subject Correlation Matrices in Functional Magnetic Resonance Imaging. IEEE-ITAB 2010. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 28 / 32

Other Unsupervised Schemes Other unsupervised schemes Various other unsupervised learning tasks in addition to clustering exist - we give just one example J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 29 / 32

Other Unsupervised Schemes PageRank 1 / 2 We have N web pages and wish to rank them in terms of importance. The Google PageRank algorithm considers a webpage to be important if many other webpages point to it. The linking webpages that point to a given page are not treated equally: the algorithm also takes into account both the importance (PageRank) of the linking pages and the number of outgoing links that they have. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 30 / 32

Other Unsupervised Schemes PageRank 2 / 2 L ij = 1 if page j points to page i and otherwise zero. c j = N i=1 L ij (the number of outlinks) PageRanks p i defined recursively via where d = 0.85, or p i = (1 d) + d N (L ij /c j )p j, j=1 p = (a d)1 + dldiag(c) 1 p. It can be shown that after proper normalization we get: p = Ap where A has one as its largest eigenvalue. Again: an eigenvalue problem, this time one solvable by the Power Method. J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 31 / 32

Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32

Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data 2 The missing training data is replaced by a user-defined structure imposed to the pattern space J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32

Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data 2 The missing training data is replaced by a user-defined structure imposed to the pattern space 3 Various approaches, eg. dissimilaritiy measures, mixture models, spectral methods J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32

Other Unsupervised Schemes Summary 1 Clustering is partioning of (or classifying) a given data set directly without labeled training data 2 The missing training data is replaced by a user-defined structure imposed to the pattern space 3 Various approaches, eg. dissimilaritiy measures, mixture models, spectral methods 4 Cluster result evaluation: measures for cluster validity J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter 11 February 25, 2014 32 / 32