Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Similar documents
Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

The goals of segmentation

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Clustering Lecture 14

Segmentation. Bottom Up Segmentation

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Segmentation (continued)

CMPSCI 670: Computer Vision! Grouping

Segmentation Computer Vision Spring 2018, Lecture 27

Lecture: k-means & mean-shift clustering

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu. Lectures 21 & 22 Segmentation and clustering

Lecture: k-means & mean-shift clustering

CS 4495 Computer Vision. Segmentation. Aaron Bobick (slides by Tucker Hermans) School of Interactive Computing. Segmentation

Lecture 7: Segmentation. Thursday, Sept 20

Unsupervised Learning: Clustering

Segmentation and Grouping April 19 th, 2018

Segmentation & Clustering

Lecture 13: k-means and mean-shift clustering

Grouping and Segmentation

Segmentation and Grouping

Image Segmentation. Selim Aksoy. Bilkent University

Image Segmentation. Selim Aksoy. Bilkent University

Segmentation & Grouping Kristen Grauman UT Austin. Announcements

Clustering. Unsupervised Learning

Grouping and Segmentation

Clustering. Unsupervised Learning

CS 2770: Computer Vision. Edges and Segments. Prof. Adriana Kovashka University of Pittsburgh February 21, 2017

Introduction to Machine Learning

Clustering. Discover groups such that samples within a group are more similar to each other than samples across groups.

Segmentation and Grouping April 21 st, 2015

Segmentation and Grouping

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Clustering. Unsupervised Learning

Outline. Segmentation & Grouping. Examples of grouping in vision. Grouping in vision. Grouping in vision 2/9/2011. CS 376 Lecture 7 Segmentation 1

Clustering: K-means and Kernel K-means

Visual Representations for Machine Learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSE 5243 INTRO. TO DATA MINING

Machine learning - HT Clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Lecture 10: Semantic Segmentation and Clustering

University of Florida CISE department Gator Engineering. Clustering Part 2

Digital Image Processing

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Lecture 16 Segmentation and Scene understanding

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CEE598 - Visual Sensing for Civil Infrastructure Eng. & Mgmt.

SGN (4 cr) Chapter 11

Mining Social Network Graphs

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Image Segmentation continued Graph Based Methods

Lecture 8: Fitting. Tuesday, Sept 25

Aarti Singh. Machine Learning / Slides Courtesy: Eric Xing, M. Hein & U.V. Luxburg

Targil 12 : Image Segmentation. Image segmentation. Why do we need it? Image segmentation

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Segmentation: Clustering, Graph Cut and EM

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 534: Computer Vision Segmentation and Perceptual Grouping

Hierarchical Clustering 4/5/17

Unsupervised Learning and Clustering

Unsupervised Learning

Dr. Ulas Bagci

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering Lecture 3: Hierarchical Methods

Cluster Analysis. Ying Shen, SSE, Tongji University

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Segmentation and Grouping

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning and Clustering

Clustering and Dimensionality Reduction

Spectral Clustering on Handwritten Digits Database

Perceptron as a graph

Network Traffic Measurements and Analysis

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Clustering Algorithms. Margareta Ackerman

Image Segmentation continued Graph Based Methods. Some slides: courtesy of O. Capms, Penn State, J.Ponce and D. Fortsyth, Computer Vision Book

Hierarchical clustering

Clustering in Data Mining

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Segmentation and low-level grouping.

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Clustering CS 550: Machine Learning

Clustering algorithms and introduction to persistent homology

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

CS 343: Artificial Intelligence

Clustering. K-means clustering

Instance-based Learning

Transcription:

Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015

So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal was to learn a mapping from features to labels Unsupervised learning: learning without a teacher Only features and no labels Why is unsupervised learning useful? Discover hidden structures in the data clustering Visualization dimensionality reduction today lower dimensional features might help learning 2/48

Clustering Basic idea: group together similar instances Example: 2D points 3/48

Clustering Basic idea: group together similar instances Example: 2D points What could similar mean? One option: small Euclidean distance (squared) dist(x, y) = x y 2 2 Clustering results are crucially dependent on the measure of similarity (or distance) between points to be clustered 4/48

Clustering algorithms Simple clustering: organize elements into k groups K-means Mean shift Spectral clustering Hierarchical clustering: organize elements into a hierarchy Bottom up - agglomerative Top down - divisive 5/48

Clustering examples Image segmentation: break up the image into similar regions image credit: Berkeley segmentation benchmark 6/48

Clustering examples Clustering news articles 7/48

Clustering examples Clustering queries CMPSCI 689 Subhransu Maji (UMASS) 8/48

Clustering examples Clustering people by space and time image credit: Pilho Kim 9/48

Clustering examples Clustering species (phylogeny) phylogeny of canid species (dogs, wolves, foxes, jackals, etc) [K. Lindblad-Toh, Nature 2005] 10/48

Clustering using k-means Given (x1, x2,, xn) partition the n observations into k ( n) sets S = {S1, S2,, Sk} so as to minimize the within-cluster sum of squared distances The objective is to minimize: arg min S kx i=1 X x2s i x µ i 2 cluster center 11/48

Lloyd s algorithm for k-means Initialize k centers by picking k points randomly among all the points Repeat till convergence (or max iterations) Assign each point to the nearest center (assignment step) arg min S Estimate the mean of each group (update step) arg min S kx i=1 kx i=1 X x2s i x µ i 2 X x2s i x µ i 2 12/48

k-means in action http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/ 13/48

Properties of the Lloyd s algorithm Guaranteed to converge in a finite number of iterations The objective decreases monotonically over time Local minima if the partitions don t change. Since there are finitely many partitions, k-means algorithm must converge Running time per iteration Assignment step: O(NKD) Computing cluster mean: O(ND) Issues with the algorithm: Worst case running time is super-polynomial in input size No guarantees about global optimality Optimal clustering even for 2 clusters is NP-hard [Aloise et al., 09] k-means++ 14/48

k-means++ algorithm A way to pick the good initial centers Intuition: spread out the k initial cluster centers The algorithm proceeds normally once the centers are initialized k-means++ algorithm for initialization: 1.Chose one center uniformly at random among all the points 2.For each point x, compute D(x), the distance between x and the nearest center that has already been chosen 3.Chose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with a probability proportional to D(x) 2 4.Repeat Steps 2 and 3 until k centers have been chosen [Arthur and Vassilvitskii 07] The approximation quality is O(log k) in expectation http://en.wikipedia.org/wiki/k-means%2b%2b 15/48

k-means for image segmentation K=2 Grouping pixels based on intensity similarity K=3 feature space: intensity value (1D) 16/48

Clustering using density estimation One issue with k-means is that it is sometimes hard to pick k The mean shift algorithm seeks modes or local maxima of density in the feature space automatically determines the number of clusters K(x) = 1 Z Kernel density estimator X x xi 2 exp h i Small h implies more modes (bumpy distribution) 17/48

Mean shift algorithm Mean shift procedure: For each point, repeat till convergence Compute mean shift vector Translate the kernel window by m(x)# Simply following the gradient of density exp x xi 2 h m( x) n # 2 x - x $ " % x g i & i % ' h ( i= 1 & ) * = % x& 2 % n # x - x $ & g i % ' h ( &, i= 1 ) * - Slide by Y. Ukrainitz & B. Sarel 18/48

Mean shift Search window Center of mass Mean Shift vector Slide by Y. Ukrainitz & B. Sarel 19/48

Mean shift Search window Center of mass Mean Shift vector Slide by Y. Ukrainitz & B. Sarel 20/48

Mean shift Search window Center of mass Mean Shift vector Slide by Y. Ukrainitz & B. Sarel 21/48

Mean shift Search window Center of mass Mean Shift vector Slide by Y. Ukrainitz & B. Sarel 22/48

Mean shift Search window Center of mass Mean Shift vector Slide by Y. Ukrainitz & B. Sarel 23/48

Mean shift Search window Center of mass Mean Shift vector Slide by Y. Ukrainitz & B. Sarel 24/48

Mean shift Search window Center of mass Slide by Y. Ukrainitz & B. Sarel 25/48

Mean shift clustering Cluster all data points in the attraction basin of a mode Attraction basin is the region for which all trajectories lead to the same mode correspond to clusters Slide by Y. Ukrainitz & B. Sarel 26/48

Mean shift for image segmentation Feature: L*u*v* color values Initialize windows at individual feature points Perform mean shift for each window until convergence Merge windows that end up near the same peak or mode 27/48

Mean shift clustering results http://www.caip.rutgers.edu/~comanici/mspami/mspamiresults.html 28/48

Mean shift discussion Pros: Does not assume shape on clusters One parameter choice (window size) Generic technique Finds multiple modes Cons: Selection of window size Is rather expensive: O(DN 2 ) per iteration Does not work well for high-dimensional features Kristen CMPSCI Grauman 689 Subhransu Maji (UMASS) 29/48

Spectral clustering [Shi & Malik 00; Ng, Jordan, Weiss NIPS 01] 30/48

Spectral clustering [Figures from Ng, Jordan, Weiss NIPS 01] 31/48

Spectral clustering Group points based on the links in a graph How do we create the graph? Weights on the edges based on similarity between the points A common choice is the Gaussian kernel One could create A fully connected graph B A W (i, j) =exp xi x j 2 2 2 k-nearest graph (each node is connected only to its k-nearest neighbors) slide credit: Alan Fern 32/48

Graph cut Consider a partition of the graph into two parts A and B Cut(A, B) is the weight of all edges that connect the two groups X Cut(A, B) = W (i, j) =0.3 i2a,j2b An intuitive goal is to find a partition that minimizes the cut min-cuts in graphs can be computed in polynomial time 33/48

Problem with min-cut The weight of a cut is proportional to number of edges in the cut; tends to produce small, isolated components. [Shi & Malik, 2000 PAMI] We would like a balanced cut 34/48

Graphs as matrices Let W(i, j) denote the matrix of the edge weights The degree of node in the graph is: d(i) = X j W (i, j) The volume of a set A is defined as: Vol(A) = X i2a d(i) 35/48

Normalized cut Intuition: consider the connectivity between the groups relative to the volume of each group: NCut(A, B) = Cut(A, B) Vol(A) + Cut(A, B) Vol(B) NCut(A, B) =Cut(A, B) Vol(A) + Vol(B) Vol(A)Vol(B) minimized when Vol(A) = Vol(B) encouraging a balanced cut Unfortunately minimizing normalized cut is NP-Hard even for planar graphs [Shi & Malik, 00] 36/48

Solving normalized cuts We will formulate an optimization problem Let W be the similarity matrix Let D be a diagonal matrix with D(i,i) = d(i) the degree of node i Let x be a vector {1, -1} N, x(i) = 1 i A The matrix (D-W) is called the Laplacian of the graph With some simplification we can show that the problem of minimizing normalized cuts can be written as: min x NCut(x) =min y y T (D W )y y T Dy subject to: y T D1 =0 y(i) 2{1, b} 37/48

Solving normalized cuts Normalized cuts objective: Relax the integer constraint on y: Same as: min x NCut(x) =min y y T (D W )y y T Dy subject to: y T D1 =0 y(i) 2{1, b} min y y T (D W )y; subject to: y T Dy =1, y T D1 =0 (D W )y = Dy (Generalized eigenvalue problem) Note that (D W )1 =0, so the first eigenvector is y1 = 1, with the corresponding eigenvalue of 0 The eigenvector corresponding to the second smallest eigenvalue is the solution to the relaxed problem 38/48

Spectral clustering example Data: Gaussian weighted edges connected to 3 nearest neighbors 39/48

Spectral clustering example Components of the eigenvector corresponding to the second smallest eigenvalue 40/48

Creating partitions from eigenvalues The eigenvalue is real valued, so to obtain a split you may threshold it How to pick the threshold? Pick the median value Choose a threshold that minimizes the normalized cut objective How to create multiple partitions? Recursively split each partition Compute multiple eigenvalues and cluster them using k-means Example: multiple eigenvalues of an image and their gradients http://ttic.uchicago.edu/~mmaire/papers/pdf/amfm_tpami2011.pdf 41/48

Hierarchical clustering CMPSCI 689 Subhransu Maji (UMASS) 42/48

Hierarchical clustering Organize elements into a hierarchy Two kinds of methods: Agglomerative: a bottom up approach where elements start as individual clusters and clusters are merged as one moves up the hierarchy Divisive: a top down approach where elements start as a single cluster and clusters are split as one moves down the hierarchy 43/48

Agglomerative clustering Agglomerative clustering: First merge very similar instances Incrementally build larger clusters out of smaller clusters Algorithm: Maintain a set of clusters Initially, each instance in its own cluster Repeat: Pick the two closest clusters Merge them into a new cluster Stop when there s only one cluster left Produces not one clustering, but a family of clusterings represented by a dendrogram 44/48

Agglomerative clustering How should we define closest for clusters with multiple elements? Many options: Closest pair: single-link clustering Farthest pair: complete-link clustering Average of all pairs 45/48

Agglomerative clustering Different choices create different clustering behaviors 46/48

Summary Clustering is an example of unsupervised learning Partitions or hierarchy Several partitioning algorithms: k-means: simple, efficient and often works in practice k-means++ for better initialization mean shift: modes of density slow but suited for problems with unknown number of clusters with varying shapes and sizes spectral clustering: clustering as graph partitions solve (D - W)x = λdx followed by k-means Hierarchical clustering methods: Agglomerative or divisive single-link, complete-link and average-link 47/48

Slides credit Slides adapted from David Sontag, Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein, James Hays, Alan Fern, and Tommi Jaakkola Many images are from the Berkeley segmentation benchmark http://www.eecs.berkeley.edu/research/projects/cs/vision/bsds Normalized cuts image segmentation: http://www.timotheecour.com/research.html 48/48