K-Means Clustering. Sargur Srihari

Similar documents
9.1. K-means Clustering

Mixture Models and EM

Clustering Lecture 5: Mixture Model

Clustering. K-means clustering

CSC 411: Lecture 12: Clustering

Supervised vs. Unsupervised Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Introduction to Mobile Robotics

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering. Supervised vs. Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

CS 229 Midterm Review

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Inference and Representation

The K-modes and Laplacian K-modes algorithms for clustering

K-Means and Gaussian Mixture Models

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Machine Learning for OR & FE

Lecture 8: The EM algorithm

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Introduction to Machine Learning CMU-10701

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Clustering: K-means and Kernel K-means

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Unsupervised Learning Partitioning Methods

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

1 Case study of SVM (Rob)

Machine Learning : Clustering, Self-Organizing Maps

Unsupervised Learning: Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

What to come. There will be a few more topics we will cover on supervised learning

Unsupervised Learning : Clustering

Clustering algorithms

ALTERNATIVE METHODS FOR CLUSTERING

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

CSE 5243 INTRO. TO DATA MINING

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Expectation Maximization (EM) and Gaussian Mixture Models

Using Machine Learning to Optimize Storage Systems

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering web search results

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

K-means and Hierarchical Clustering

Lecture 3 January 22

COMS 4771 Clustering. Nakul Verma

Clustering. Chapter 10 in Introduction to statistical learning

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

INF 4300 Classification III Anne Solberg The agenda today:

CSE 5243 INTRO. TO DATA MINING

Machine Learning. Unsupervised Learning. Manfred Huber

Based on Raymond J. Mooney s slides

k-means, k-means++ Barna Saha March 8, 2016

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering and The Expectation-Maximization Algorithm

Artificial Intelligence. Programming Styles

CS Introduction to Data Mining Instructor: Abdullah Mueen

Expectation Maximization!

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Mixture Models and the EM Algorithm

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Clustering CS 550: Machine Learning

Unsupervised Learning

Variational Autoencoders. Sargur N. Srihari

A Course in Machine Learning

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Lecture 2 The k-means clustering problem

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Introduction to Artificial Intelligence

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Cluster Analysis. Ying Shen, SSE, Tongji University

ECS 234: Data Analysis: Clustering ECS 234

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

S(x) = arg min s i 2S kx

10701 Machine Learning. Clustering

Announcements. Image Segmentation. From images to objects. Extracting objects. Status reports next Thursday ~5min presentations in class

Data Clustering. Danushka Bollegala

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Transcription:

K-Means Clustering Sargur srihari@cedar.buffalo.edu 1

Topics in Mixture Models and EM Mixture models K-means Clustering Mixtures of Gaussians Maximum Likelihood EM for Gaussian mistures EM Algorithm Gaussian mixture models motivates EM Latent variable viewpoint K-means seen as non-probabilistic limit of EM applied to mixture of Gaussians EM in generality Bernoulli Mixture Models Infinite Mixture Models 2

K-means Clustering Given data set {x 1,..,x N } in D-dimensional Euclidean space Partition into K clusters, which is given One of K coding Indicator variable r nk {0,1} where k =1,..,K Describes which of K clusters data point x n is assigned to 3

Distortion measure Sum of squared errors J = N K r nk n=1 k=1 x n µ k 2 Goal is to find values for {r nk } and the {µ k } so as to minimize J Can be done by iterative procedure Each iteration has two steps Successive optimization w.r.t. r nk and µ k µ 4

Two Updating Stages First choose initial values for µ k First phase: minimize J w.r.t. r nk keeping µ k fixed Second phase: minimize J w.r.t. µ k keeping r nk fixed Two stages correspond to E (expectation) and M (maximization) of EM algorithm Expectation: what is the expected class? Maximization: what is the mle of the mean? 5

E: Determination of Indicator r nk Because is a linear function of r nk this optimization is performed easily (closed form solution) Terms involving different n are independent Therefore can optimize for each n separately Choosing r nk to be 1 for whichever value of k gives minimum value of x n - µ k 2 Thus r nk = Interpretation: N K J = r nk x n µ k 2 n=1 k=1 1 if k = arg min j x n µ j 2 0 otherwise Assign x n to cluster whose mean is closest 6

M: Optimization of µ k Hold r nk fixed N K Objective function J = r nk x n µ k is a quadratic 2 n=1 k=1 function of µ k Minimized by setting derivative w.r.t. µ k to zero Thus 2 r nk (x n µ k ) = 0 n=1 Which is solved to give Interpretation: N µ k = n r nk x n Set µ k to mean of all data points x n assigned to cluster k n r nk Equal to no of points assigned to cluster k 7

Termination of K-Means Two phases re-assigning data points to clusters Re-computing means of clusters Done repeatedly until no further change in assignments Since each phase reduces J convergence is assured May converge to local minimum of J 8

Illustration of K-means Old Faithful dataset Single Gaussian is a poor fit We choose K = 2 Data set is standardized so each variable has zero mean and unit standard deviation Assignment of each data point to nearest cluster center is equivalent to which side of the perpendicular bisector of line joining cluster centers 9

K-means iterations Initial Choice of Means (Parameters) E step M step E step: parameters are fixed Distributions are optimized E step M step E step M step: distributions are fixed Parameters are optimized M step E step M step Final Clusters And Means 10

Cost Function after Iteration J for Old Faithful Data Poor initial value chosen for cluster centers Several steps needed for convergence Better choice is to assign µ k to random subset of k data points K-means is itself used to initialize parameters for Gaussian mixture model before applying EM 11

Faster Implementation of K-means Direct implementation can be slow In E step Euclidean distances are computed between every mean and every data point x n - µ k 2 is computed for n=1,..n and k=1,..k Faster implementations exist Precomputing trees where nearby points are on same sub-tree Use of triangle inequality to avoid unnecessary distance calculation 12

On-line Stochastic Version Instead of batch processing entire data set Apply Robbins-Monro procedure To finding roots of the regression function given by the derivative of J w.r.t µ k µ new k = µ old k + η ( old n x n µ ) k where η n is a learning rate parameter made to decrease monotonically as more samples are observed 13

Dissimilarity Measure Euclidean distance has limitations Inappropriate for categorical labels Cluster means are non-robust to outliers Use more general dissimilarity measure ν(x,x ) and distortion measure N K J! = r nk ν(x n,µ k ) n=1 k=1 Which gives the k-medoids algorithm M-step is potentially more complex than for k- means 14

Limitation of K-means Every data point is assigned uniquely to one and only one cluster A point may be equidistant from two cluster centers A probabilistic approach will have a soft assignment of data points reflecting the level of uncertainty 15

Image Segmentation and Compression Goal: partition image into regions each of which has homogeneous visual appearance or corresponds to objects or parts of objects Each pixel is a point in R_G_B space K-means clustering is used with a palette of K colors Method does not take into account proximity of different pixels 16

K-means in Image Segmentation Two examples where 2, 3, and 10 colors are chosen to encode a color image 17

Data Compression Lossless data compression Able to reconstruct data exactly from compressed representation Lossy data compression Accept some error in return for greater compression K-means for Lossy compression For each of N data points store only identity k of cluster center to which it is assigned Store values of cluster centers µ k where K<<N Vectors m k are called code-book vectors Method is called Vector Quantization Data compression achieved Original image needs 24N bits (R,G,B need 8 bits each) Compressed image needs 24K+Nlog 2 K bits For K=2,3 and 10, compression ratios are 4%,8% and 17% 18

Data Set: Handwriting Style Features Meuhlberger, et. al. Journal of Forensic Sciences, 1975 Val(X) =4 x 5 x 3 x 4 x 4 x 5 = 4,800 P(R,L,A,C,B,T) No of parameters = 4,799

Clustering of handwriting styles Letters th characterized by six multinomial features 20